CN114298211A - Feature binning method and device, electronic equipment and storage medium - Google Patents

Feature binning method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114298211A
CN114298211A CN202111608427.6A CN202111608427A CN114298211A CN 114298211 A CN114298211 A CN 114298211A CN 202111608427 A CN202111608427 A CN 202111608427A CN 114298211 A CN114298211 A CN 114298211A
Authority
CN
China
Prior art keywords
box
binning
feature
result
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111608427.6A
Other languages
Chinese (zh)
Inventor
艾森阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Welab Information Technology Shenzhen Ltd
Original Assignee
Welab Information Technology Shenzhen Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Welab Information Technology Shenzhen Ltd filed Critical Welab Information Technology Shenzhen Ltd
Priority to CN202111608427.6A priority Critical patent/CN114298211A/en
Publication of CN114298211A publication Critical patent/CN114298211A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Complex Calculations (AREA)

Abstract

The invention relates to the field of data processing, and discloses a characteristic binning method, which comprises the following steps: performing a binning operation on each first feature in the first sample set by adopting a first binning method and a first binning quantity to obtain a first binning result; receiving the sum of the label ciphertexts of each sample in each second box after each second characteristic is subjected to box separation and sent by a second participant, wherein the sum of the label ciphertexts is obtained by performing box separation operation on each second characteristic in a second sample set by adopting a second box separation method and a second box separation quantity and then calculating; and when the first and second binning results are judged to be reasonable, selecting the first and second target characteristics, coding the first target characteristic to obtain binned data, and sending the second target characteristic to the second participant for the second participant to code the second target characteristic to complete binning. The invention also provides a characteristic box separating device, electronic equipment and a storage medium. The invention increases the flexibility of box separation and improves the rationality of box separation.

Description

Feature binning method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of data processing, and in particular, to a method and an apparatus for characteristic binning, an electronic device, and a storage medium.
Background
With the increase of the public demand for data security, federal learning is concerned more and more, and combines cryptography and artificial intelligence, so that all participants of joint modeling can jointly model the data of multiple participants without revealing own original data information under the protection of cryptographic technology.
In the process of combined modeling, each feature of data needs to be subjected to binning, and the purpose of binning is to discretize the feature so that the established model has higher robustness. When multiple parties jointly model, different features can only select the same binning method under normal conditions, the method is lack of flexibility and can cause that binning results are unreasonable, and therefore the accuracy of the jointly established model is not high. Therefore, a characteristic binning method is needed to improve the binning flexibility and ensure the rationality of the binning result.
Disclosure of Invention
In view of the above, there is a need to provide a characteristic binning method, which aims to improve the binning flexibility and ensure the rationality of the binning result.
The invention provides a feature binning method which is applied to a first participant, wherein the first participant is in communication connection with a second participant, and the first participant and the second participant have the same sample object and different sample features, and the method comprises the following steps:
encrypting the label of each sample in a first sample set stored locally by adopting a homomorphic encryption key to obtain a label ciphertext of each sample, establishing a mapping relation between the label ciphertext and the sample ID, and sending the mapping relation to a second participant;
performing a binning operation on each first feature to be binned in the first sample set by adopting a first binning method and a first binning quantity to obtain a first binning result corresponding to each first feature, and calculating the first quantity of each label in each first bin after each first feature is binned based on the first binning result;
receiving the sum of the label ciphertexts of each sample in each second box after each second characteristic is subjected to box separation, wherein the sum of the label ciphertexts is obtained by the second party based on the mapping relation and a second box separation result, and the second box separation result is obtained by the second party by performing box separation operation on each second characteristic to be subjected to box separation in a second sample set locally stored by the second party by adopting a second box separation method and a second box separation quantity;
and judging whether the first and second binning results are reasonable or not based on the sum of the first quantity and the tag ciphertext, if so, selecting a first target feature and a second target feature to be subjected to modulo from the first and second features, respectively, performing coding processing on the first target feature to obtain binned data, and sending the second target feature to a second participant for the second participant to code the second target feature to complete binning.
Optionally, the determining whether the first binning result and the second binning result are reasonable based on the sum of the first number and the tag ciphertext includes:
determining a second number of the labels in each second box after each second characteristic is subjected to box separation based on the sum of the label ciphertexts;
coding the features in each first box based on the first number to obtain a first coding value corresponding to each first box, and judging whether the corresponding first box dividing result is reasonable or not based on the first coding value;
and coding the features in each second box based on the second quantity to obtain a second coded value corresponding to each second box, and judging whether the corresponding second box dividing result is reasonable or not based on the second coded value.
Optionally, the determining, based on the sum of the ciphertext of the tags, a second number of each tag in each second box after each second feature is binned includes:
receiving the sum of ciphertext difference values corresponding to each second box after each second characteristic is subjected to box separation and sent by a second party, wherein the sum of the ciphertext difference values is the sum of a preset numerical value and a difference value of a label ciphertext of each sample in the corresponding second box;
determining a second number of first tags in each second box based on the sum of the tag ciphertexts;
determining a second number of second tags in each second bin based on a sum of the ciphertext differences.
Optionally, the determining whether the corresponding first binning result is reasonable based on the first encoding value includes:
if the first code value of each first box corresponding to a certain first characteristic is monotonicity, the first box dividing result corresponding to the first characteristic is reasonable.
Optionally, after determining whether the first binning result and the second binning result are reasonable, the method further includes:
if a certain first binning result is judged to be unreasonable, adjusting the first binning result, and performing rationality judgment on the adjusting result;
and if the result of one second box division is judged to be unreasonable, sending early warning information to the second participant to remind the second participant to adjust the second box division result and execute rationality judgment on the adjusted result.
Optionally, the selecting a first target feature and a second target feature to be molded from the first feature and the second feature respectively includes:
calculating a first information quantity value corresponding to each first feature based on the first coding value;
calculating a second information quantity value corresponding to each second feature based on the second coding value;
and taking a first characteristic corresponding to a first information quantity value of the numerical value in a preset numerical value interval as a first target characteristic, and taking a second characteristic corresponding to a second information quantity value of the numerical value in the preset numerical value interval as a second target characteristic.
Optionally, the calculation formula of the first encoded value is:
Figure BDA0003432375760000031
wherein, WOEa-iA first coding value Q corresponding to the ith first box after the ith first characteristic is subjected to box separationa-iA first number, P, of second labels in the ith first box after the a-th first characteristic is separateda-iIs the a first character(ii) characterizing a first number, Q, of first labels in an ith first box after the box is splitTIs the total number of second tags, P, in the first set of samplesTThe total number of first tags in the first set of samples.
In order to solve the above problems, the present invention also provides a characteristic box separation apparatus, comprising:
the encryption module is used for encrypting the label of each sample in the first sample set stored locally by adopting a homomorphic encryption key to obtain a label ciphertext of each sample, establishing a mapping relation between the label ciphertext and the sample ID, and sending the mapping relation to a second participant;
the box dividing module is used for executing box dividing operation on each first feature to be subjected to box dividing in the first sample set by adopting a first box dividing method and a first box dividing quantity to obtain a first box dividing result corresponding to each first feature, and calculating the first quantity of each label in each first box after each first feature is subjected to box dividing based on the first box dividing result;
the receiving module is used for receiving the sum of the label ciphertexts of each sample in each second box after each second characteristic is subjected to box separation, wherein the sum of the label ciphertexts is obtained by the second party based on the mapping relation and a second box separation result, and the second box separation result is obtained by the second party by performing box separation operation on each second characteristic to be subjected to box separation in a second sample set locally stored by the second party by adopting a second box separation method and a second box separation quantity;
and the judging module is used for judging whether the first and second binning results are reasonable or not based on the sum of the first quantity and the tag ciphertext, when the judgment is yes, respectively selecting a first target feature and a second target feature to be subjected to modulo from the first and second features, performing coding processing on the first target feature to obtain binned data, and sending the second target feature to a second participant for the second participant to code the second target feature so as to complete the binning.
In order to solve the above problem, the present invention also provides an electronic device, including:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a feature binning program executable by the at least one processor to enable the at least one processor to perform the above-described feature binning method.
In order to solve the above problem, the present invention also provides a computer-readable storage medium having stored thereon a feature binning program which is executable by one or more processors to implement the above feature binning method.
Compared with the prior art, the method comprises the steps of firstly adopting a first binning method and the first binning quantity to perform binning operation on each first feature in a first sample set to obtain a first binning result corresponding to each first feature; then, receiving the sum of the label ciphertexts of each sample in each second box after each second characteristic is subjected to box separation, wherein the sum of the label ciphertexts is obtained by calculation based on a second box separation result, and the second box separation result is obtained by the second participant performing box separation operation on each second characteristic in the second sample set by adopting a second box separation method and a second box separation quantity; and finally, judging whether the first binning result and the second binning result are reasonable or not, if so, selecting the first target feature and the second target feature, coding the first target feature to obtain binned data, and sending the second target feature to the second participant for the second participant to code the second target feature to complete binning. According to the invention, each participant can select one or more classification methods and corresponding classification quantity according to the characteristics of respective sample set, so that the classification flexibility is increased, and the classification rationality is improved.
Drawings
FIG. 1 is a schematic flow chart of a feature binning method according to an embodiment of the present invention;
FIG. 2 is a block diagram of a feature binning apparatus according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device implementing a feature binning method according to an embodiment of the present invention;
the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the description relating to "first", "second", etc. in the present invention is for descriptive purposes only and is not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.
The invention provides a characteristic box dividing method which is applied to a first participant, wherein the first participant is in communication connection with a second participant. Fig. 1 is a schematic flow chart of a feature binning method according to an embodiment of the present invention. The method may be performed by an electronic device (corresponding to the first party), which may be implemented by software and/or hardware.
In this embodiment, the first participant and the second participant have the same sample object and different sample characteristics, and the characteristic binning method includes:
s1, encrypting the label of each sample in the first sample set stored locally by adopting a homomorphic encryption key to obtain a label ciphertext of each sample, establishing a mapping relation between the label ciphertext and the sample ID, and sending the mapping relation to the second participant.
In this embodiment, the first participant and the second participant have the same sample object and different sample characteristics, the first sample set locally stored by the first participant carries a tag, and the second sample set locally stored by the second participant does not contain a tag.
For example, if the first participant is a bank and the second participant is a shopping platform, the first sample set corresponding to the bank contains sample data with a sample ID of 1-1000, wherein the sample data comprises the number of deposits, the deposit amount, the loan amount and the loan amount of the user in the last half year; the second sample set corresponding to the shopping platform contains sample data with the sample ID of 1-1000, wherein the sample data comprises the shopping times, the shopping amount and the shopping type of the user in the last half year.
The sample ID may be a user ID, and the user ID may be an identification number, a mobile phone number, a job number, and other information for identifying the user identity. The labels carried by the first sample set corresponding to the bank can comprise a first label and a second label, wherein the first label is represented by 1, the second label is represented by 0, the first label represents that the user purchases a specified financial product, and the second label represents that the user does not purchase the specified financial product, so that in the first sample set, the sample with the label of 1 is a positive sample, and the sample with the label of 0 is a negative sample.
In this embodiment, a homomorphic encryption key is used to encrypt the tag, and the homomorphic encryption algorithm has the following characteristics: and calculating the homomorphic encrypted data by adopting a preset calculation formula to obtain an operation result, decrypting the operation result, wherein the decrypted result is the same as the result obtained by calculating the unencrypted original data by adopting the same calculation formula. By adopting a homomorphic encryption algorithm, the normal operation of the data can be ensured on the basis of not revealing the original data.
S2, performing a binning operation on each first feature to be binned in the first sample set by adopting a first binning method and the first binning quantity to obtain a first binning result corresponding to each first feature, and calculating the first quantity of each label in each first bin after each first feature is binned based on the first binning result.
The features to be binned are usually continuous features or discrete features with a large number of values.
In this embodiment, a plurality of binning methods are provided, and each participant can use the same binning method or different binning methods for different characteristics of the sample set according to actual conditions (each characteristic is independently binned, and the binning process of the other characteristics is not interfered with each other), and after the binning method is determined, the corresponding binning number can be determined by himself.
The box separation method comprises an equal-frequency, equidistant, optimal and user-defined box separation method. The equal-frequency box dividing method means that the number of samples distributed in each box is consistent; the equidistant box dividing method means that the segmented areas of all boxes are consistent; the optimal box dividing method is to divide boxes by using recursive division and determine a dividing point according to statistical test; the user-defined segmentation method means that a user defines segmentation points by himself.
For example, the first binning method selected by the bank for the deposit times and loan times in the first sample set is an equal-frequency binning method, and the corresponding first binning quantity is 5 bins; the first binning method selected for the deposit amount and the loan amount is an equidistant binning method, and the corresponding first binning quantity is 10 bins.
After the first box sorting result corresponding to each first feature is obtained, the first number of each label in each first box is respectively calculated, for example, for the number of deposits, the first number of the first label and the first number of the second label in 5 first boxes are respectively (30,20), (23,27), (36,14), (18,32), (28, 22).
And S3, receiving the sum of the label ciphertexts of each sample in each second box after each second characteristic is subjected to box separation, wherein the sum of the label ciphertexts is obtained by the second party based on the mapping relation and the second box separation result, and the second box separation result is obtained by the second party by performing box separation operation on each second characteristic to be subjected to box separation in the second sample set locally stored by the second party by adopting a second box separation method and the second box separation quantity.
The second participant selects a second binning method and a second binning number for each second feature to be binned in a second sample set locally stored by the second participant, for example, the second binning method selected by the shopping platform for the number of times of shopping is an equidistant binning method, and the corresponding second binning number is 6; the second binning method selected for the shopping amount is a custom binning method, and the corresponding second binning quantity is 15.
The second sample set corresponding to the second participant does not contain label information, and the second sample set needs to be assisted by the label ciphertext of the first sample set to count the number of the first label and the second label in each box with each second characteristic.
The tag ciphertext is encrypted by a homomorphic encryption key, and a new random number is generated in each encryption process for encryption, so that the same value obtained by encrypting the tag is different, for example, the tag ciphertext obtained by encrypting the first tag of the first sample in the first sample set is 5, and the tag ciphertext obtained by encrypting the first tag of the 3 rd sample may be 12, so that the number of the tag ciphertexts corresponding to the first tag and the second tag cannot be directly counted.
And S4, judging whether the first and second binning results are reasonable or not based on the sum of the first number and the label ciphertext, if so, selecting a first target feature and a second target feature to be subjected to modulo from the first and second features, performing coding processing on the first target feature to obtain binned data, and sending the second target feature to a second participant for the second participant to code the second target feature to complete the binning.
In this embodiment, whether the binning result is reasonable is determined by the weight of the indicator term WOE (evidence weight), the target feature to be modelled is selected by the information amount of the indicator term IV, and then WOE encoding is performed on the corresponding target feature by each participant to obtain binned data, thereby completing binning.
The determining whether the first binning result and the second binning result are reasonable based on the first number and the sum of the tag ciphertexts includes steps a11-a 13:
a11, determining a second number of the labels in each second box after the second characteristics are classified based on the sum of the label ciphertext;
determining a second number of the respective tags in each second box after the respective second feature binning based on the sum of the tag ciphertexts, including steps B11-B13:
b11, receiving the sum of ciphertext difference values corresponding to each second box after each second characteristic is subjected to binning and sent by a second party, wherein the sum of ciphertext difference values is the sum of a preset numerical value and the difference value of the label ciphertext of each sample in the corresponding second box;
the second party not only sends the sum of the tag ciphertexts but also sends the sum of the cipher text differences to the first party, in this embodiment, the preset value is 1, that is, the sum of the tag ciphertexts is
Figure BDA0003432375760000071
The sum of the ciphertext differences is
Figure BDA0003432375760000072
Wherein S isa-iThe sum of the label ciphertexts corresponding to the ith second box after the ith second characteristic is subjected to the sub-box [ ya-ij]A label ciphertext of a jth sample in an ith second box after the ith second characteristic is subjected to the box separation, n is the total number of samples in the ith second box after the ith second characteristic is subjected to the box separation, and Ca-iAnd the ciphertext difference value is the sum of the ciphertext difference values corresponding to the ith second box after the ith second characteristic is subjected to box separation.
B12, determining a second number of first labels in each second box based on the sum of the label ciphertexts;
the second number of first tags can be obtained by decrypting the sum of the tag ciphertexts with the homomorphic encryption key for the following reasons: and decrypting the homomorphic encrypted data, wherein the value obtained by decryption is the same as the sum of the original data, so that the value obtained by decryption of the ciphertext of the label is the sum of the first label, and if the first label is 1, the sum of the first label is the second number of the first label.
B13, determining a second number of second labels in each second box based on the sum of the ciphertext differences.
The sum of the ciphertext differences is
Figure BDA0003432375760000081
Decrypted value and (1-y)a-ij) Is the same as that of (1-y)a-ij) Is a second number of second tags, such that a second number of second tags is obtained by the sum of the ciphertext differences.
A12, coding the features in each first box based on the first number to obtain a first coding value corresponding to each first box, and judging whether the corresponding first box dividing result is reasonable or not based on the first coding value;
in this embodiment, WOE encoding is performed on the features in each first box to obtain a first encoded value corresponding to each first box.
The calculation formula of the first encoding value is as follows:
Figure BDA0003432375760000082
wherein, WOEa-iA first coding value Q corresponding to the ith first box after the ith first characteristic is subjected to box separationa-iA first number, P, of second labels in the ith first box after the a-th first characteristic is separateda-iA first number, Q, of first labels in an ith first box binned for an a-th first characteristicTIs the total number of second tags, P, in the first set of samplesTThe total number of first tags in the first set of samples.
The determining whether the corresponding first binning result is reasonable based on the first encoding value includes:
if the first code value of each first box corresponding to a certain first characteristic is monotonicity, the first box dividing result corresponding to the first characteristic is reasonable.
Monotonicity means monotonously increasing or monotonously decreasing, for example, if the loan number is divided into 4 bins, the first code values are 0.2, 0.4, 0.45, 0.5, and 0.53, respectively, and the trend is increasing, the binning result corresponding to the loan number is reasonable.
And A13, coding the features in each second box based on the second quantity to obtain a second coded value corresponding to each second box, and judging whether the corresponding second box dividing result is reasonable or not based on the second coded value.
The calculation process of the second coding value is the same as that of the first coding value, and the judgment process of whether the second binning result is reasonable is the same as that of the first binning result.
After the determining whether the first binning result and the second binning result are reasonable, the method further includes:
c11, if a certain first binning result is judged to be unreasonable, adjusting the first binning result, and performing rationality judgment on the adjustment result;
in this embodiment, if a certain first binning result is not reasonable, some first bins are merged by using a user-defined binning method, so that the adjusted WOE code value corresponding to each first bin is monotonous.
And C12, if the result of one second box is judged to be unreasonable, sending an early warning message to the second party to remind the second party to adjust the result of the second box and execute rationality judgment on the adjusted result.
The early warning information comprises second coding values of the second boxes corresponding to the second box dividing results.
The selecting a first target feature and a second target feature to be molded from the first feature and the second feature respectively comprises:
d11, calculating a first information quantity value corresponding to each first feature based on the first coding value;
the calculation formula of the first information quantity value is as follows:
Figure BDA0003432375760000091
wherein IVaA first information quantity value, WOE, corresponding to the a-th first featurea-iA first coding value Q corresponding to the ith first box after the ith first characteristic is subjected to box separationa-iA first number, P, of second labels in the ith first box after the a-th first characteristic is separateda-iA first number, Q, of first labels in an ith first box binned for an a-th first characteristicTIs the total number of second tags, P, in the first set of samplesTN is the total number of the first labels in the first sample set, and is the first bin number corresponding to the a-th first feature.
D12, calculating a second information quantity value corresponding to each second feature based on the second encoding value;
the formula for calculating the second information quantity value is the same as the formula for calculating the first information quantity value.
And D13, taking a first characteristic corresponding to a first information quantity value with a numerical value in a preset numerical value interval as a first target characteristic, and taking a second characteristic corresponding to a second information quantity value with a numerical value in the preset numerical value interval as a second target characteristic.
The information quantity value can measure the importance of the features, and in this embodiment, the preset value interval can be 0.1-0.5. If the information quantity value is not in the interval, the corresponding characteristic has little effect in modeling and can be deleted.
As can be seen from the foregoing embodiments, in the feature binning method provided by the present invention, first, a binning operation is performed on each first feature in a first sample set by using a first binning method and a first binning number, so as to obtain a first binning result corresponding to each first feature; then, receiving the sum of the label ciphertexts of each sample in each second box after each second characteristic is subjected to box separation, wherein the sum of the label ciphertexts is obtained by calculation based on a second box separation result, and the second box separation result is obtained by the second participant performing box separation operation on each second characteristic in the second sample set by adopting a second box separation method and a second box separation quantity; and finally, judging whether the first binning result and the second binning result are reasonable or not, if so, selecting the first target feature and the second target feature, coding the first target feature to obtain binned data, and sending the second target feature to the second participant for the second participant to code the second target feature to complete binning. According to the invention, each participant can select one or more classification methods and corresponding classification quantity according to the characteristics of respective sample set, so that the classification flexibility is increased, and the classification rationality is improved.
Fig. 2 is a schematic block diagram of a characteristic box separating device according to an embodiment of the present invention.
The characteristic binning apparatus 100 of the present invention may be installed in an electronic device. According to the implemented functions, the feature binning apparatus 100 may include an encryption module 110, a binning module 120, a receiving module 130, and a determining module 140. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the encryption module 110 is configured to encrypt the tag of each sample in the first sample set stored locally by using a homomorphic encryption key to obtain a tag ciphertext of each sample, establish a mapping relationship between the tag ciphertext and the sample ID, and send the mapping relationship to the second party.
The binning module 120 is configured to perform binning operation on each first feature to be binned in the first sample set by using a first binning method and a first binning number to obtain a first binning result corresponding to each first feature, and calculate, based on the first binning result, a first number of each label in each first box after each first feature is binned.
The receiving module 130 is configured to receive a sum of the tag ciphertexts of each sample in each second box after each second feature is binned, where the sum of the tag ciphertexts is obtained by the second party through calculation based on the mapping relationship and a second binning result, and the second binning result is obtained by the second party performing binning operation on each second feature to be binned in a second sample set locally stored by the second party by using a second binning method and a second binning number.
And the judging module 140 is configured to judge whether the first binning result and the second binning result are reasonable based on the sum of the first number and the tag ciphertext, select a first target feature and a second target feature to be subjected to modulo from the first feature and the second feature respectively when the first binning result and the second binning result are reasonable, perform coding processing on the first target feature to obtain binned data, and send the second target feature to the second participant so that the second participant codes the second target feature to complete binning.
The determining whether the first binning result and the second binning result are reasonable based on the first number and the sum of the tag ciphertexts includes steps a21-a 23:
a21, determining a second number of the labels in each second box after the second characteristics are classified based on the sum of the label ciphertext;
determining a second number of the respective tags in each second box after the respective second feature binning based on the sum of the tag ciphertexts, including steps B21-B23:
b21, receiving the sum of ciphertext difference values corresponding to each second box after each second characteristic is subjected to binning and sent by a second party, wherein the sum of ciphertext difference values is the sum of a preset numerical value and the difference value of the label ciphertext of each sample in the corresponding second box;
b22, determining a second number of first labels in each second box based on the sum of the label ciphertexts;
b23, determining a second number of second labels in each second box based on the sum of the ciphertext differences.
A22, coding the features in each first box based on the first number to obtain a first coding value corresponding to each first box, and judging whether the corresponding first box dividing result is reasonable or not based on the first coding value;
the calculation formula of the first encoding value is as follows:
Figure BDA0003432375760000111
wherein, WOEa-iA first coding value Q corresponding to the ith first box after the ith first characteristic is subjected to box separationa-iA first number, P, of second labels in the ith first box after the a-th first characteristic is separateda-iA first number, Q, of first labels in an ith first box binned for an a-th first characteristicTIs the total number of second tags, P, in the first set of samplesTThe total number of first tags in the first set of samples.
The determining whether the corresponding first binning result is reasonable based on the first encoding value includes:
if the first code value of each first box corresponding to a certain first characteristic is monotonicity, the first box dividing result corresponding to the first characteristic is reasonable.
And A23, coding the features in each second box based on the second quantity to obtain a second coded value corresponding to each second box, and judging whether the corresponding second box dividing result is reasonable or not based on the second coded value.
After determining whether the first binning result and the second binning result are reasonable, the determining module 140 is further configured to:
c21, if a certain first binning result is judged to be unreasonable, adjusting the first binning result, and performing rationality judgment on the adjustment result;
and C22, if the result of one second box is judged to be unreasonable, sending an early warning message to the second party to remind the second party to adjust the result of the second box and execute rationality judgment on the adjusted result.
The selecting a first target feature and a second target feature to be molded from the first feature and the second feature respectively comprises:
d21, calculating a first information quantity value corresponding to each first feature based on the first coding value;
d22, calculating a second information quantity value corresponding to each second feature based on the second encoding value;
and D23, taking a first characteristic corresponding to a first information quantity value with a numerical value in a preset numerical value interval as a first target characteristic, and taking a second characteristic corresponding to a second information quantity value with a numerical value in the preset numerical value interval as a second target characteristic.
Fig. 3 is a schematic structural diagram of an electronic device for implementing a feature binning method according to an embodiment of the present invention.
The electronic device 1 is a device capable of automatically performing numerical calculation and/or information processing in accordance with a command set or stored in advance. The electronic device 1 may be a computer, or may be a single network server, a server group composed of a plurality of network servers, or a cloud composed of a large number of hosts or network servers based on cloud computing, where cloud computing is one of distributed computing and is a super virtual computer composed of a group of loosely coupled computers.
In the present embodiment, the electronic device 1 includes, but is not limited to, a memory 11, a processor 12, and a network interface 13, which are communicatively connected to each other through a system bus, wherein the memory 11 stores a feature binning program 10, and the feature binning program 10 is executable by the processor 12. While fig. 3 shows only the electronic device 1 with components 11-13 and the feature binning program 10, it will be understood by those skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1 and may comprise fewer or more components than shown, or some components may be combined, or a different arrangement of components.
The storage 11 includes a memory and at least one type of readable storage medium. The memory provides cache for the operation of the electronic equipment 1; the readable storage medium may be a non-volatile storage medium such as flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the readable storage medium may be an internal storage unit of the electronic device 1, such as a hard disk of the electronic device 1; in other embodiments, the non-volatile storage medium may also be an external storage device of the electronic device 1, such as a plug-in hard disk provided on the electronic device 1, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. In this embodiment, the readable storage medium of the memory 11 is generally used for storing an operating system and various application software installed in the electronic device 1, such as code of the feature binning program 10 in an embodiment of the present invention. Further, the memory 11 may also be used to temporarily store various types of data that have been output or are to be output.
Processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 12 is generally configured to control the overall operation of the electronic device 1, such as performing control and processing related to data interaction or communication with other devices. In this embodiment, the processor 12 is configured to run the program code stored in the memory 11 or process data, such as running the feature binning program 10.
The network interface 13 may comprise a wireless network interface or a wired network interface, and the network interface 13 is used for establishing a communication connection between the electronic device 1 and a client (not shown).
Optionally, the electronic device 1 may further include a user interface, the user interface may include a Display (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface may further include a standard wired interface and a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The memory 11 of the electronic device 1 stores a feature binning program 10 which is a combination of instructions that, when executed in the processor 12, may implement the steps of the feature binning method described above.
Specifically, the processor 12 may refer to the description of the relevant steps in the embodiment corresponding to fig. 1 for a specific implementation method of the feature binning program 10, which is not described herein again.
Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer readable medium may be non-volatile or non-volatile. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
The computer readable storage medium has stored thereon a feature binning program 10, which feature binning program 10 is executable by one or more processors to implement the steps in the feature binning method described above.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A feature binning method for use by a first participant in communication with a second participant, wherein the first participant and the second participant have the same sample object and different sample features, the method comprising:
encrypting the label of each sample in a first sample set stored locally by adopting a homomorphic encryption key to obtain a label ciphertext of each sample, establishing a mapping relation between the label ciphertext and the sample ID, and sending the mapping relation to a second participant;
performing a binning operation on each first feature to be binned in the first sample set by adopting a first binning method and a first binning quantity to obtain a first binning result corresponding to each first feature, and calculating the first quantity of each label in each first bin after each first feature is binned based on the first binning result;
receiving the sum of the label ciphertexts of each sample in each second box after each second characteristic is subjected to box separation, wherein the sum of the label ciphertexts is obtained by the second party based on the mapping relation and a second box separation result, and the second box separation result is obtained by the second party by performing box separation operation on each second characteristic to be subjected to box separation in a second sample set locally stored by the second party by adopting a second box separation method and a second box separation quantity;
and judging whether the first and second binning results are reasonable or not based on the sum of the first quantity and the tag ciphertext, if so, selecting a first target feature and a second target feature to be subjected to modulo from the first and second features, respectively, performing coding processing on the first target feature to obtain binned data, and sending the second target feature to a second participant for the second participant to code the second target feature to complete binning.
2. The method of feature binning as claimed in claim 1, wherein said determining whether the first binning result and the second binning result are reasonable based on the sum of the first number and the tag ciphertext comprises:
determining a second number of the labels in each second box after each second characteristic is subjected to box separation based on the sum of the label ciphertexts;
coding the features in each first box based on the first number to obtain a first coding value corresponding to each first box, and judging whether the corresponding first box dividing result is reasonable or not based on the first coding value;
and coding the features in each second box based on the second quantity to obtain a second coded value corresponding to each second box, and judging whether the corresponding second box dividing result is reasonable or not based on the second coded value.
3. The method of feature binning as in claim 2, wherein the tags comprise a first tag and a second tag, and said determining a second number of respective tags in each second bin after respective second feature binning based on the sum of the ciphertext of the tags comprises:
receiving the sum of ciphertext difference values corresponding to each second box after each second characteristic is subjected to box separation and sent by a second party, wherein the sum of the ciphertext difference values is the sum of a preset numerical value and a difference value of a label ciphertext of each sample in the corresponding second box;
determining a second number of first tags in each second box based on the sum of the tag ciphertexts;
determining a second number of second tags in each second bin based on a sum of the ciphertext differences.
4. The method of feature binning as claimed in claim 2, wherein said determining whether a corresponding first binning result is reasonable based on the first coding value comprises:
if the first code value of each first box corresponding to a certain first characteristic is monotonicity, the first box dividing result corresponding to the first characteristic is reasonable.
5. The method of feature binning as claimed in claim 1, wherein after said determining whether the first binning result and the second binning result are reasonable, the method further comprises:
if a certain first binning result is judged to be unreasonable, adjusting the first binning result, and performing rationality judgment on the adjusting result;
and if the result of one second box division is judged to be unreasonable, sending early warning information to the second participant to remind the second participant to adjust the second box division result and execute rationality judgment on the adjusted result.
6. The method of claim 1, wherein said selecting a first target feature and a second target feature to be molded from a first feature and a second feature, respectively, comprises:
calculating a first information quantity value corresponding to each first feature based on the first coding value;
calculating a second information quantity value corresponding to each second feature based on the second coding value;
and taking a first characteristic corresponding to a first information quantity value of the numerical value in a preset numerical value interval as a first target characteristic, and taking a second characteristic corresponding to a second information quantity value of the numerical value in the preset numerical value interval as a second target characteristic.
7. The feature binning method of claim 2 wherein said first coded value is calculated by the formula:
Figure FDA0003432375750000021
wherein, WOEa-iA first coding value Q corresponding to the ith first box after the ith first characteristic is subjected to box separationa-iA first number, P, of second labels in the ith first box after the a-th first characteristic is separateda-iA first number, Q, of first labels in an ith first box binned for an a-th first characteristicTIs the total number of second tags, P, in the first set of samplesTThe total number of first tags in the first set of samples.
8. A feature binning apparatus, characterized in that the apparatus comprises:
the encryption module is used for encrypting the label of each sample in the first sample set stored locally by adopting a homomorphic encryption key to obtain a label ciphertext of each sample, establishing a mapping relation between the label ciphertext and the sample ID, and sending the mapping relation to a second participant;
the box dividing module is used for executing box dividing operation on each first feature to be subjected to box dividing in the first sample set by adopting a first box dividing method and a first box dividing quantity to obtain a first box dividing result corresponding to each first feature, and calculating the first quantity of each label in each first box after each first feature is subjected to box dividing based on the first box dividing result;
the receiving module is used for receiving the sum of the label ciphertexts of each sample in each second box after each second characteristic is subjected to box separation, wherein the sum of the label ciphertexts is obtained by the second party based on the mapping relation and a second box separation result, and the second box separation result is obtained by the second party by performing box separation operation on each second characteristic to be subjected to box separation in a second sample set locally stored by the second party by adopting a second box separation method and a second box separation quantity;
and the judging module is used for judging whether the first and second binning results are reasonable or not based on the sum of the first quantity and the tag ciphertext, when the judgment is yes, respectively selecting a first target feature and a second target feature to be subjected to modulo from the first and second features, performing coding processing on the first target feature to obtain binned data, and sending the second target feature to a second participant for the second participant to code the second target feature so as to complete the binning.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a feature binning program executable by the at least one processor to enable the at least one processor to perform the feature binning method of any of claims 1-7.
10. A computer-readable storage medium having stored thereon a feature binning program executable by one or more processors to implement the feature binning method of any one of claims 1 to 7.
CN202111608427.6A 2021-12-24 2021-12-24 Feature binning method and device, electronic equipment and storage medium Pending CN114298211A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111608427.6A CN114298211A (en) 2021-12-24 2021-12-24 Feature binning method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111608427.6A CN114298211A (en) 2021-12-24 2021-12-24 Feature binning method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114298211A true CN114298211A (en) 2022-04-08

Family

ID=80970438

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111608427.6A Pending CN114298211A (en) 2021-12-24 2021-12-24 Feature binning method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114298211A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116244650A (en) * 2023-05-12 2023-06-09 北京富算科技有限公司 Feature binning method, device, electronic equipment and computer readable storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116244650A (en) * 2023-05-12 2023-06-09 北京富算科技有限公司 Feature binning method, device, electronic equipment and computer readable storage medium
CN116244650B (en) * 2023-05-12 2023-10-03 北京富算科技有限公司 Feature binning method, device, electronic equipment and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN111756522B (en) Data processing method and system
CN112732297B (en) Method and device for updating federal learning model, electronic equipment and storage medium
CN114625976B (en) Data recommendation method, device, equipment and medium based on federal learning
CN113420049B (en) Data circulation method, device, electronic equipment and storage medium
CN111612458A (en) Method and device for processing block chain data and readable storage medium
CN113868529A (en) Knowledge recommendation method and device, electronic equipment and readable storage medium
CN112860737A (en) Data query method and device, electronic equipment and readable storage medium
CN115795517A (en) Asset data storage method and device
CN114298211A (en) Feature binning method and device, electronic equipment and storage medium
CN112217639B (en) Data encryption sharing method and device, electronic equipment and computer storage medium
CN114422105A (en) Joint modeling method and device, electronic equipment and storage medium
CN115643090A (en) Longitudinal federal analysis method, device, equipment and medium based on privacy retrieval
CN112286703B (en) User classification method and device, client device and readable storage medium
CN114760073B (en) Block chain-based warehouse commodity distribution method and device, electronic equipment and medium
CN114298321A (en) Joint modeling method and device, electronic equipment and storage medium
CN115329002A (en) Data asynchronous processing method based on artificial intelligence and related equipment
CN112182598A (en) Public sample ID identification method, device, server and readable storage medium
CN113657546A (en) Information classification method and device, electronic equipment and readable storage medium
CN112446765A (en) Product recommendation method and device, electronic equipment and computer-readable storage medium
CN113590703A (en) ES data importing method and device, electronic equipment and readable storage medium
CN111611601A (en) Multi-data-party user analysis model joint training method and device and storage medium
CN115311061B (en) Electronic transaction management method, device, equipment and storage medium based on digital authentication
CN111652742B (en) User data processing method, device, electronic equipment and readable storage medium
CN108846652A (en) A kind of sign method and the method for commerce of block chain digital token
CN114978529A (en) Block chain-based identity verification method and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination