CN111401572B - Supervision characteristic box dividing method and device based on privacy protection - Google Patents

Supervision characteristic box dividing method and device based on privacy protection Download PDF

Info

Publication number
CN111401572B
CN111401572B CN202010502530.1A CN202010502530A CN111401572B CN 111401572 B CN111401572 B CN 111401572B CN 202010502530 A CN202010502530 A CN 202010502530A CN 111401572 B CN111401572 B CN 111401572B
Authority
CN
China
Prior art keywords
binning
values
sequence
feature
holder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010502530.1A
Other languages
Chinese (zh)
Other versions
CN111401572A (en
Inventor
李漓春
张文彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202010502530.1A priority Critical patent/CN111401572B/en
Publication of CN111401572A publication Critical patent/CN111401572A/en
Application granted granted Critical
Publication of CN111401572B publication Critical patent/CN111401572B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/008Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols involving homomorphic encryption
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/08Key distribution or management, e.g. generation, sharing or updating, of cryptographic keys or passwords
    • H04L9/0816Key establishment, i.e. cryptographic processes or cryptographic protocols whereby a shared secret becomes available to two or more parties, for subsequent use
    • H04L9/0819Key transport or distribution, i.e. key establishment techniques where one party creates or otherwise obtains a secret value, and securely transfers it to the other(s)
    • H04L9/0825Key transport or distribution, i.e. key establishment techniques where one party creates or otherwise obtains a secret value, and securely transfers it to the other(s) using asymmetric-key encryption or public key infrastructure [PKI], e.g. key signature or public key certificates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the specification provides a supervised feature binning method and device based on privacy protection. Both parties store private data respectively. The label holder sends the N first encrypted label values which are encrypted in the same state to the feature holder; the feature holder associates the N first encrypted tag values with the N feature values, reorders the N feature values according to the value sizes to obtain a first sequence formed by the N feature values arranged according to the updating sequence and a second sequence formed by the N second encrypted tag values, and sends the second sequence to the tag holder; the label holder decrypts the second encrypted label value in the second sequence to obtain the original label value in each initial box, performs characteristic box separation based on the original label value to obtain a first box separation result, and sends the first box separation result to the characteristic holder; and the characteristic holder performs binning on the N characteristic values according to the first binning result.

Description

Supervision characteristic box dividing method and device based on privacy protection
Technical Field
One or more embodiments of the present description relate to the field of data processing technologies, and in particular, to a method and an apparatus for supervised feature binning based on privacy protection.
Background
Binning is a method of processing features in machine learning modeling. Binning a feature is to group a (possibly large) set of feature values of the feature and treat each group as a class value, i.e., to group many values in the set into a few class values. For example, for the characteristic of age, all age values in each sample form a discrete value set from 1 to 50, and the grouping of the set may result in the following 3 bins, where age values form a bin from 1 to 15, a bin from 16 to 35, and a bin from 35 to 50. The characteristics are subjected to box separation, so that continuous variables can be discretized, and multi-state discrete variables are rarely stateful. The characteristics after binning can bring many performance improvements to model training, for example, rapid iteration to the model can be easier, the stability of the model can be improved, overfitting of the model can be reduced, and the like.
The method for separating boxes comprises an unsupervised box separating method and a supervised box separating method. In unsupervised binning, there is no need to rely on sample labels to bin features. While in supervised binning, features need to be binned in conjunction with sample labels.
In the supervised binning, one application scenario is that the characteristics and labels of the samples are distributed in different owners, and each owner has a requirement on privacy protection on respective data, and does not output own data in a clear text manner. However, both parties need supervised binning of features for the purpose of joint training of models and the like. Accordingly, improved schemes are desired that enable supervised binning of features in scenarios where features and tags are distributed among different parties, while ensuring privacy and security of private data.
Disclosure of Invention
One or more embodiments of the present specification describe a supervised feature binning method and apparatus based on privacy protection, so as to implement supervised binning on features in a scenario where the features and tags are distributed in different parties, and simultaneously ensure privacy and security of private data. The specific technical scheme is as follows.
In a first aspect, a supervised feature binning method based on privacy protection is provided, which is performed by a feature holder, where the feature holder stores feature values of a first feature of N samples, original tag values of the N samples are stored in a tag holder, and the N samples are arranged in a given order; the method comprises the following steps:
acquiring N first encrypted tag values which are sent by the tag holder and arranged according to the set sequence, wherein each first encrypted tag value is obtained by using a public key to homomorphically encrypt a corresponding original tag value;
on the basis of the established sequence, respectively associating the N first encryption tag values with N characteristic values of the first characteristic to obtain an association relation;
reordering the N characteristic values according to the value size to obtain a first sequence consisting of N characteristic values arranged according to an updating sequence, and processing to obtain a second sequence consisting of N second encryption tag values arranged according to the updating sequence based on the incidence relation;
at least sending the second sequence to the tag holder, so that the tag holder performs characteristic binning based on at least the second sequence to obtain a first binning result;
receiving the first binning result sent by the label holder, wherein the first binning result shows first binning corresponding to each position in the updating sequence;
and according to the first binning result, binning the characteristic values of all positions in the first sequence to obtain a characteristic binning result.
In a second aspect, embodiments provide a supervised feature binning method based on privacy protection, which is performed by a tag holder, where the tag holder stores original tag values of N samples, where a feature value of a first feature in the N samples is stored in the feature holder, and the N samples are arranged in a given order; the method comprises the following steps:
using a public key to homomorphically encrypt the N original label values into corresponding first encrypted label values, and sending the N first encrypted label values arranged according to the set sequence to the feature holder;
receiving a second sequence transmitted by at least the feature holder; the second sequence is composed of N second encryption tag values arranged according to an updating sequence;
decrypting the N second encrypted tag values in the second sequence into corresponding original tag values by using a private key corresponding to the public key to obtain N original tag values arranged according to the updating sequence;
performing adjacent binning and merging operation at least based on the N original label values arranged according to the updating sequence to obtain a first binning result, wherein the first binning result shows that each position in the updating sequence corresponds to the first binning;
sending the first binned result to the feature holder.
In a third aspect, the embodiment provides a supervised feature binning method based on privacy protection, which is performed by a tag holder, wherein the tag holder stores original tag values of N samples, the feature value of a first feature in the N samples is stored in the feature holder, and the N samples are arranged in a given order; the method comprises the following steps:
using a public key to homomorphically encrypt the N original label values into corresponding first encrypted label values, and sending the N first encrypted label values arranged according to the set sequence to the feature holder;
receiving at least the second sequence sent by the feature holder, wherein the second sequence is composed of N second encryption tag values arranged according to an updating sequence;
decrypting the N second encrypted tag values in the second sequence into corresponding original tag values by using a private key corresponding to the public key to obtain N original tag values arranged according to the updating sequence;
splitting and binning operation is carried out at least on the basis of the N original label values arranged according to the updating sequence to obtain a first binning result, wherein the first binning result shows that the first binning result corresponds to each position in the updating sequence;
sending the first binned result to the feature holder.
In a fourth aspect, embodiments provide a supervised feature binning apparatus based on privacy protection, deployed in a feature holder, where the feature holder stores feature values of a first feature of N samples, where original tag values of the N samples are stored in the tag holder, and the N samples are arranged in a given order; the device comprises:
an obtaining module configured to obtain N first encrypted tag values arranged according to the predetermined sequence, where each first encrypted tag value is obtained by homomorphically encrypting a corresponding original tag value using a public key;
the association module is configured to associate the N first encrypted tag values with the N feature values of the first feature, respectively, based on the predetermined order, to obtain an association relationship;
the rearrangement module is configured to rearrange the N characteristic values according to the value sizes to obtain a first sequence formed by the N characteristic values arranged according to an updating sequence, and process to obtain a second sequence formed by the N second encryption tag values arranged according to the updating sequence based on the incidence relation;
a first sending module configured to send at least the second sequence to the tag holder, so that the tag holder performs feature binning based on at least the second sequence to obtain a first binning result;
a first receiving module, configured to receive the first binning result sent by the tag holder, where the first binning result shows first binning corresponding to each position in the update sequence;
and the first binning module is configured to bin the characteristic values of all positions in the first sequence according to the first binning result to obtain a characteristic binning result.
In a fifth aspect, embodiments provide a supervised feature binning apparatus based on privacy protection, deployed in a tag holder, where the tag holder stores original tag values of N samples, where feature values of a first feature in the N samples are stored in the feature holder, and the N samples are arranged in a given order; the device comprises:
the encryption module is configured to homomorphically encrypt the N original tag values into corresponding first encrypted tag values by using a public key, and send the N first encrypted tag values arranged according to the set sequence to the feature holder;
a second receiving module configured to receive at least a second sequence transmitted by the feature holder; the second sequence is composed of N second encryption tag values arranged according to an updating sequence;
a decryption module configured to decrypt the N second encrypted tag values in the second sequence into corresponding original tag values using a private key corresponding to the public key, to obtain N original tag values arranged in the update order;
the second binning module is configured to perform adjacent binning merging operation at least based on the N original tag values arranged according to the update sequence to obtain a first binning result, where the first binning result shows first binning corresponding to each position in the update sequence;
a second sending module configured to send the first binned result to the feature holder.
In a sixth aspect, embodiments provide a supervised feature binning apparatus based on privacy protection, deployed in a tag holder, where the tag holder stores original tag values of N samples, where feature values of a first feature in the N samples are stored in the feature holder, and the N samples are arranged in a given order; the device comprises:
the encryption module is configured to homomorphically encrypt the N original tag values into corresponding first encrypted tag values by using a public key, and send the N first encrypted tag values arranged according to the set sequence to the feature holder;
a second receiving module configured to receive the second sequence at least sent by the feature holder, where the second sequence is composed of N second encryption tag values arranged in an update order;
a decryption module configured to decrypt the N second encrypted tag values in the second sequence into corresponding original tag values using a private key corresponding to the public key, to obtain N original tag values arranged in the update order;
a third binning module configured to perform splitting binning operation at least based on the N original tag values arranged according to the update sequence to obtain a first binning result, where the first binning result shows first binning corresponding to each position in the update sequence;
a second sending module configured to send the first binned result to the feature holder.
In a seventh aspect, embodiments provide a computer-readable storage medium, on which a computer program is stored, which, when executed in a computer, causes the computer to perform the method of any one of the first to third aspects.
In an eighth aspect, an embodiment provides a computing device, including a memory and a processor, where the memory stores executable code, and the processor executes the executable code to implement the method of any one of the first to third aspects.
According to the method and the device provided by the embodiment of the specification, the tag holder performs homomorphic encryption on the original tag value and then sends the original tag value to the feature holder, the feature holder associates the feature value with the encrypted tag value, reorders the feature value, and sends the reordered encrypted tag value to the tag holder. In this way, the tag holder can obtain the updated sorted original tag value by decryption, perform further feature binning operation based on the updated sorted original tag value, and send the obtained first binning result to the feature holder. The whole interaction process does not send any plaintext data, simultaneously realizes the supervision and the box separation of the characteristics, and ensures the privacy and the safety of the privacy data as far as possible by homomorphic encryption.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
FIG. 1 is a schematic diagram illustrating an implementation scenario of an embodiment disclosed herein;
FIG. 2 is a schematic diagram illustrating an exemplary interaction flow for binning between two parties;
FIG. 3 is a schematic diagram of the locations of equal feature values in a feature holder and the first bin numbers;
FIG. 4 is a flowchart illustrating an iterative implementation of step S260 in FIG. 2;
FIG. 5 is a schematic diagram illustrating an interaction flow for binning between two parties in accordance with another embodiment;
FIG. 6 is a schematic diagram of a split-based iterative binning process;
FIG. 7 is a schematic block diagram of a binning apparatus deployed at a feature holder provided by one embodiment;
FIG. 8 is a schematic block diagram of a binning apparatus deployed at a tag holder provided in one embodiment;
fig. 9 is a schematic block diagram of a binning apparatus deployed at a tag holder according to another embodiment.
Detailed Description
The scheme provided by the specification is described below with reference to the accompanying drawings.
Fig. 1 is a schematic view of an implementation scenario of an embodiment disclosed in this specification. The feature holder 10 stores feature values of first features of N samples, and the label holder 20 stores original label values of the N samples, where the N samples are arranged in a predetermined order. Wherein N is a natural number. The N samples may be used as samples in a model test set, or may be used as samples in a model training set, which is not limited in this specification.
The first feature may be any one of a plurality of features of the sample. For example, each sample may be one of the following business objects: users, goods, merchants, events, and the like. When the sample is a commodity, the characteristics of the sample may include: price, sales, etc., the labels of the sample may include categories of high-volume goods, medium-volume goods, low-volume goods, etc.; when the sample is a user, the characteristics of the sample may include: user age, income, amount of consumption, etc., the label of the sample may include a usage frequency value for a certain client. The feature value of the first feature may be a discrete feature value or a continuous feature value. The labels of the samples may be classification labels or non-classification labels, i.e., numerical labels.
In a risk control scenario, the characteristics of the sample may include user data. Users can be classified into risky users (abnormal users) and non-risky users (normal users), and the user data is private data that needs to be kept secret. In such a scenario, the label of the sample may be a classification label, and the dataset in which the sample is located may be used to train the risk control model.
The feature holder 10 stores sample identifications of the N samples and corresponding feature values (i.e., feature data) of the first feature, which are exemplarily replaced with xx in fig. 1. The tag holder 20 stores sample identifications of N samples and corresponding tag values (i.e., tag data), and the value of the tag is exemplarily replaced with yy in fig. 1. The feature holder 10 and the tag holder may share a sample ordering. The N samples are arranged in a given order, which is to be understood as meaning that the ordering of the samples by the feature holder 10 and the ordering of the labels by the label holder 20 may be such that: the nth sample of the feature holder 10 and the nth tag of the tag holder 20 are matched to be a complete sample with a tag, where N is a value less than or equal to N. For example, the 1 st sample of the feature holder 10 is used to describe the features of the user a (such as age, height of academic calendar, monthly consumption amount, etc.), the 1 st tag of the tag holder 20 is used to describe whether the user a belongs to a high-risk user (such as "1" for high-risk user and "0" for non-high-risk user), and the 1 st sample of the feature holder 10 and the 1 st tag of the tag holder 20 constitute a tagged sample describing the user a. As another example, the 16 th sample of the feature holder 10 is used to describe the features of the user P (such as age, height of academic story, monthly consumption amount, etc.), the 16 th tag of the tag holder 20 is used to describe whether the user P belongs to a high-risk user (such as "1" for high-risk user and "0" for non-high-risk user), and the 16 th sample of the feature holder 10 and the 16 th tag of the tag holder 20 constitute a tagged sample describing the user P. The above feature data and tag data both belong to privacy data.
When the characteristics and the labels of the samples are distributed in different holders, and each holder has a requirement for privacy protection on respective data, and cannot output own data in a plaintext manner, in order to realize supervision and binning on the characteristics and not reveal privacy data of each holder, the embodiment of the specification provides a binning method combining two parties. Referring to the interaction process shown in fig. 1, a tag holder 20 homomorphically encrypts an original tag value to obtain a first encrypted tag value, and sends the first encrypted tag value to a feature holder 10; the feature holder 10 reorders the feature values, processes the feature values according to the association relationship between the feature values and the received first encrypted tag values to obtain second encrypted tag values in an update sequence, and sends the second encrypted tag values in the update sequence to the tag holder 20; the tag holder 20 performs further feature binning operation based on the received information to obtain a first binning result and sends it to the feature holder 10; the feature holder 10 performs binning on the feature values in the update sequence based on the first binning result to obtain a feature binning result. Therefore, the whole interaction process does not have any plaintext data transmission, the supervision and the box separation of the characteristics are realized, and the privacy and the safety of the privacy data are ensured as much as possible by adopting homomorphic encryption.
In fig. 1, each bin (bin 1, bin 2, and bin 3) obtained in feature holder 10 may contain a different number of samples, and in general, the samples in the different bins do not overlap. The feature binning result is only an example, and a feature binning result in a particular application may contain a different number of bins and different samples in each bin.
When the feature holder contains a plurality of features of the sample, the binning method of the embodiment of the present specification may be adopted for each feature, and the supervised binning of the feature is realized through the interaction with the feature holder. When multiple features of a sample are distributed among different feature holders, this may be performed by the feature holder where the features to be binned are located and the label holder in the manner of the embodiments of the present specification.
The following describes an embodiment of the present specification in more detail with reference to a scene diagram shown in fig. 1.
Fig. 2 is a flowchart illustrating a method for privacy-based supervised feature binning according to an embodiment, which is performed by interaction between a feature holder 10 and a tag holder 20. The feature holder 10 stores feature values of a first feature of the N samples, and original tag values of the N samples are stored in the tag holder 20. In both cases, the N samples are arranged in a predetermined order, which may be the order of the sample Identifiers (IDs) from small to large or from large to small, or may be the order of the sample identifiers displayed in the designated dictionary. The method includes the following steps S210 to S280.
In step S210, the tag holder 20 uses the public Key1 to homomorphically encrypt the N original tag values into corresponding first encrypted tag values, for example, homomorphically encrypt the original tag value yy into e (yy), and sends the N first encrypted tag values arranged in the predetermined order to the feature holder 10. The feature holder 10 may receive N first encrypted tag values in a predetermined order transmitted by the tag holder 20.
The feature holder 10 may first send a data acquisition request to the tag holder 20, and when receiving the data acquisition request, the tag holder 20 uses the public Key1 to homomorphically encrypt the N original tag values into corresponding first encrypted tag values, and sends the N first encrypted tag values arranged in the predetermined order to the feature holder 10.
In this step, the tag holder 20 may directly generate the public Key1 and the corresponding private Key2 for homomorphic encryption, or may directly acquire the public Key1 and the private Key2 that are generated in advance.
The values of the N original tag values may be included in a certain value range, and each specific value may appear many times. For example, the original tag value may take a value within a range of [0, 1, 2, 3, 4], and each value within the above range may appear multiple times in the N original tag values. However, the tag holder 20 encrypts the N original tag values one by one using the public key, and all the obtained first encrypted tag values can be made different from each other, so that the feature holder cannot estimate the private data in the tag holder from the values of the N first encrypted tag values arranged in the predetermined order. Therefore, the encryption mode can better keep the tag value of the tag holder secret, and any privacy data are not disclosed as far as possible. The detailed description of the implementation process of homomorphic encryption is described at the end of this embodiment.
In step S220, the feature holder 10 associates the N first encrypted tag values with the N feature values of the first feature, respectively, based on the predetermined sequence, to obtain an association relationship. The eigenvalue holder 10 stores N eigenvalues arranged in a predetermined order in advance, and may directly associate each of the N eigenvalues arranged in the predetermined order with a corresponding first encrypted tag value. For example, table 1 shows the association of N feature values of a first feature with corresponding first cryptographic label values in feature holder 10.
TABLE 1
Sample ID 1 2 3 4 5 6 N
Characteristic value xx xx xx xx xx xx xx
First encrypted tag value E(yy) E(yy) E(yy) E(yy) E(yy) E(yy) E(yy)
And the values of the characteristic values are different, the values of the original label values are different, and the first encryption label values are different from each other.
And associating the characteristic value with the first encryption tag value to obtain the corresponding relation among the sample ID, the characteristic value and the first encryption tag value.
In step S230, the feature holder 10 reorders the N feature values according to the value sizes to obtain a first sequence of N feature values arranged in the update order, and processes the first sequence of N second encrypted tag values arranged in the update order based on the association relationship to obtain a second sequence of N second encrypted tag values arranged in the update order. In step S240, the feature holder 10 at least transmits the second sequence to the tag holder 20, and the tag holder 20 receives the second sequence transmitted by the feature holder 10.
In an embodiment, the N second encryption tag values arranged in the update order may be directly equal to the N first encryption tag values arranged in the update order, that is, the second encryption tag values are equal to the corresponding first encryption tag values. In another embodiment, the first encrypted tag value may be further processed to obtain a corresponding second encrypted tag value, and then N second encrypted tag values arranged according to the update sequence are obtained. In such an embodiment, the second cryptographic label value is not equal to the corresponding first cryptographic label value. The specific implementation process of this embodiment is described in detail in other embodiments after this embodiment. The step of processing the first encrypted tag value to obtain the corresponding second encrypted tag value may be performed before, after, or simultaneously with the reordering of the N feature values.
The type of the first feature may include a continuous type and a discrete type. The discrete features include features with values having a size order relationship, such as age features and height features. For the continuous type, features such as revenue, transaction number, and the like are included.
And sorting according to the value size, wherein sorting according to the value from large to small and sorting according to the value from small to large are carried out, and a specific sorting mode is selected and implemented.
Due to the fact that corresponding relations exist among the sample ID, the characteristic values and the first encryption label values, after the N characteristic values are reordered according to the value sizes, the characteristic value sequence and the second encryption label value sequence after reordering can be obtained, and the first sequence formed by the N characteristic values arranged according to the updating sequence and the second sequence formed by the N second encryption label values arranged according to the updating sequence can be obtained.
For example, referring to table 1, in each column, a sample ID, a feature value, and a first encryption tag value form an association relationship, and after N feature values in table 1 are reordered, a plurality of columns in table 1 transform positions to obtain N feature values arranged in an update order as shown in table 2, and a corresponding first encryption tag value, a second encryption tag value, and an original tag value. And each column corresponds to a position after the update sequence. The feature holder 10 has the positions after the update sequence, and the sample ID, the feature value, the first encryption tag value, and the second encryption tag value, and does not have the original tag values. The original tag values listed in the last row of table 2 are merely for illustrating the correspondence between the original tag values and the first and second encrypted tag values.
TABLE 2
Each position after updating sequence Position 1 Position 2 Position 3 Position 4 Position N
Sample ID
5 22 3 55 14
Characteristic value xx xx xx xx xx
First encrypted tag value E(yy) E(yy) E(yy) E(yy) E(yy)
Second encrypted tag value E′(yy) E′(yy) E′(yy) E′(yy) E′(yy) E′(yy)
Original tag value yy yy yy yy yy yy
The feature holder 10 sends a second sequence to the tag holder 20, which in effect carries the updated sequence of positions.
The feature holder 10 may further determine whether there is an equal feature value in the N feature values, and if not, directly send the second sequence to the tag holder 20; if there are equal eigenvalues, then the positions of equal eigenvalues in the update sequence described above may be determined based on the N eigenvalues in the first sequence, and the second sequence and the positions of equal eigenvalues in the update sequence may be sent to the tag holder 20. When binning feature values, equal feature values should be split into the same bin, but not into different bins. Thus, when there is an equal feature value among the N feature values, the feature holder 10 can also transmit the position where the equal feature value exists in the update order to the tag holder 20.
Of the N eigenvalues in the first sequence, there may be multiple sets of equal eigenvalues, e.g., the fifth and sixth eigenvalues are equal, the ninth and tenth eigenvalues are equal, and so on.
Referring to the schematic diagram of feature binning shown in fig. 3, where the first feature is revenue, and referring to the first two rows of fig. 3 after reordering a plurality of feature values from small to large, there are 6 groups of equal feature values in these updated feature values, and overall, these updated feature values can be divided into 7 groups, and the number of each group and the position of the equal feature value contained therein are shown in fig. 3.
Since the respective positions in the first and second sequences correspond, the positions of equal eigenvalues in the update order described above also correspond to the positions in the second sequence.
The positions of the equal feature values in the update sequence can be represented in any one of the following ways:
preset interval symbols exist among positions in the updating sequence and are used for marking the positions of the same characteristic values;
or, each position in the updating sequence is represented by a one-dimensional bitmap, and the positions with the same characteristic value in each position are distinguished by a specified value distribution rule in the one-dimensional bitmap. The one-dimensional bitmap may contain N bits, corresponding to each position.
The updating sequence of the first sequence or the second sequence may include N positions in total, each position may be represented by other information, for example, characters such as 0 or 1, a, and so on may be used for representing, and a preset spacer is added between different feature values, and when there is no preset spacer in the middle of adjacent positions, it indicates that the feature values corresponding to the adjacent positions are equal.
In connection with the example shown in fig. 3, the positions of the equal feature values in the update sequence can be expressed as: 00-0-0000-00-000-00-00. Where 0 represents each position, -is a preset spacer. Or as: 01, 02-03-04, 05,06, 07-08, 09-10, 11, 12-13, 14-15, 16. Wherein, each position is represented by a continuous two-digit numerical sequence number, and positions of different characteristic values are separated by a space between the positions.
When using bitmap representation, in conjunction with the example shown in fig. 3, the various positions in the update order may be represented using the following one-dimensional bitmap: 0010000110001100. wherein each digit represents a position, adjacent identical digits represent positions of identical characteristic values, and adjacent different digits represent positions of different characteristic values. The various positions in the update order may also be represented in the following one-dimensional bitmap: 0011000101001010. wherein, from left to right, when the next different eigenvalue is encountered, 1 is adopted to represent the jump between different eigenvalues.
No matter which way the above is adopted to represent the positions in the update sequence, it does not reveal any feature data stored in the feature holder and any distribution rule between the feature data.
In step S250, the tag holder 20 decrypts the N second encrypted tag values in the second sequence into corresponding original tag values using the private Key2 corresponding to the public Key1, so as to obtain N original tag values arranged in the update order. In decryption, the second encrypted tag value may be decrypted using the private Key2 using a decryption algorithm corresponding to the algorithm used for the above homomorphic encryption. When the second encrypted tag value is equal to the corresponding first encrypted tag value, the private Key2 may be used to directly decrypt the second encrypted tag value to obtain the corresponding original tag value.
In step S260, the tag holder 20 performs a binning merging operation on the N original tag values arranged in the update sequence to obtain a first binning result, where the first binning result shows the first binning corresponding to each position in the update sequence. In step S270, the tag holder 20 transmits the first binning result to the feature holder 10, and the feature holder 10 receives the first binning result transmitted from the tag holder 20.
When the tag holder 20 directly receives the second sequence transmitted by the feature holder 10, it is considered that the values of the N feature values in the feature holder 10 are different from each other. In this case, the tag holder 20 may directly determine the N original tag values arranged in the update order as N initial bins, and perform an adjacent bin merging operation on the N initial bins to obtain a first bin result.
When the tag holder 20 receives the positions of the equal feature values in the update sequence in addition to the second sequence transmitted by the feature holder 10, it is considered that the values of the N feature values in the feature holder 10 have the same feature value. In this case, the tag holder 20 may determine the initial bins corresponding to the N original tag values arranged in the update order according to the positions of the relative feature values in the update order, and perform the adjacent bin merging operation on each initial bin to obtain the first bin result.
The first binning result shows first binning corresponding to each position in the update sequence, and it can be understood that the first binning result shows first binning corresponding to a position where each feature value is arranged according to the update sequence, and also shows first binning corresponding to a position where each original tag value is arranged according to the update sequence. The positions of the first sequence, the second sequence and the N original label values arranged according to the updating sequence are mutually corresponding.
When the tag holder 20 performs the adjacent binning and merging operation based on the N original tag values arranged in the update sequence, the adjacent binning and merging operation may be performed in a chi-square binning manner, or other binning methods based on merging may be used.
The chi-square binning is a discretization method for variable bottom-up (i.e. merging-based) data, and combines adjacent sections with minimum chi-square values together depending on chi-square test in statistics until a certain stopping condition is met.
The first binning result may be represented in any of the following ways:
preset spacers exist among positions in the updating sequence and are used for distinguishing different adjacent first sub-boxes;
or, each position in the updating sequence is represented by a one-dimensional bitmap, and different first boxes corresponding to each position are distinguished by a specified numerical value distribution rule in the one-dimensional bitmap. The one-dimensional bitmap may contain N bits, corresponding to each position.
For a detailed description of the above two modes, reference may be made to the description in step S230 in conjunction with fig. 3, which is not described herein again. See the first binning result given in the last row of fig. 3. By adopting the mode to send the first box dividing result, the sent data volume can be reduced, and the data transmission efficiency is improved. In addition to representing the first binning results in the manner described above, the tag holder 20 may also send the N second encrypted tag values arranged in the update order and the corresponding first bins to the feature holder 10.
No matter which way the first binning result is expressed, it does not reveal any tag data stored in the tag holder, and any distribution rules between tag data.
In step S280, the feature holder 10 performs binning on the feature values at each position in the first sequence according to the first binning result to obtain a feature binning result. The feature holder 10 may associate each position in the first binning result with each position in the first sequence, and determine the first bin of each position in the first binning result as the bin of the feature value of the corresponding position in the first sequence. In this way, it is possible to determine to which bin the first feature of each sample in the feature holder 10 belongs.
As can be seen from the above, in this embodiment, the tag holder performs homomorphic encryption on the original tag value and then sends the encrypted tag value to the feature holder, and the feature holder associates the feature value with the encrypted tag value, reorders the feature value, and sends the reordered encrypted tag value to the tag holder. In this way, the tag holder can obtain the updated sorted original tag value by decryption, perform further feature binning operation based on the updated sorted original tag value, and send the obtained first binning result to the feature holder. The whole interaction process does not send any plaintext data, simultaneously realizes the supervision and the box separation of the characteristics, and ensures the privacy and the safety of the privacy data as far as possible by homomorphic encryption.
The homomorphic encryption mentioned in the above embodiment is an encryption algorithm in which a plaintext is encrypted after being operated, and the result is equivalent to that obtained by performing a corresponding operation on a ciphertext after being encrypted. E.g. encrypted with the same public Key1
Figure DEST_PATH_IMAGE002
And
Figure DEST_PATH_IMAGE004
to obtain
Figure DEST_PATH_IMAGE006
And
Figure DEST_PATH_IMAGE008
and if so:
Figure DEST_PATH_IMAGE010
then it is assumed that the encryption algorithm satisfies the additive homomorphism, where
Figure DEST_PATH_IMAGE012
Corresponding to a homomorphic add operation. In the practice of the method, the raw material,
Figure 663006DEST_PATH_IMAGE012
the operations may correspond to conventional addition, multiplication, etc. For example, in the Paillier algorithm,
Figure 111305DEST_PATH_IMAGE012
corresponding to conventional multiplication.
When the tag holder 20 performs homomorphic encryption on each original tag value, it may also randomly generate an encrypted random number r, and perform homomorphic encryption operation on the original tag value by using the public Key1 and the encrypted random number r to obtain a first encrypted tag value. For example, in the Paillier algorithm, a formula can be employed
Figure DEST_PATH_IMAGE014
And encrypting the original tag value m to obtain a first encrypted tag value C. Wherein m is a plaintext, and C is a ciphertext; key1 is a public Key, generally the product of two very large prime numbers; r is the encrypted random number and mod is the remainder function. Because the generated encrypted random numbers r are different every time, different encrypted tag values can be obtained every time the original tag value is homomorphic encrypted. This ensures that all cryptographic tag values are different from each other.
In another embodiment of the present specification, the N original tag values may be integers, for example, integers between 0 and k, where k-1 is the total number of types of the original tag values, and k is a natural number. When the feature holder 10 obtains the second sequence of N second encrypted tag values arranged in the update order based on the association relationship in step S230, the following steps may be performed:
generating a corresponding random number p for any one of the N first encryption tag values, multiplying the random number p by a designated integer value M to obtain a transformed random number pM, homomorphically encrypting the transformed random number pM into an encryption random number E (pM) by using a public Key1, homomorphically adding the encryption random number E (pM) and the first encryption tag value E (yy) to obtain a corresponding second encryption tag value E (pM + yy), and determining a second sequence consisting of the N second encryption tag values arranged in an updating sequence based on the incidence relation.
Wherein the specified integer value M is greater than the maximum of the N original tag values. For example, when the N original tag values take integers between [0, k-1], the specified integer value M may take an integer of k or greater. In generating the corresponding random number p, p may be randomly generated within a predetermined range of integers such that the random number p takes an integer.
In one embodiment, different first cryptographic tag values may correspond to different random numbers, or may correspond to the same random number. Where different first cryptographic label values correspond to different random numbers, the different first cryptographic label values may correspond to different cryptographic random numbers. Under the condition that different first encryption tag values correspond to the same random number, the transformation random number corresponding to the random number is encrypted in a homomorphic manner each time the first encryption tag value is processed, so that different first encryption tag values correspond to different encryption random numbers, and further the interference processing of the first encryption tag value can be realized.
After the feature holder 10 performs the above-described processing on each first encrypted tag value, it is possible to avoid directly transmitting the first encrypted tag value to the tag holder 20, and further avoid the tag holder 20 comparing the update sequence of the first encrypted tag value with the predetermined sequence of the first encrypted tag value generated by itself and thereby estimating partial feature value information. Even if the tag holder 20 embeds a special token in the first encrypted tag value, the processing operations described above for the first encrypted tag value can confuse the characteristic token so that the tag holder 20 cannot infer any private information of the feature holder from the received information.
After such processing is performed on the first encrypted tag value, the second sequence transmitted by the feature holder 10 in step S240 is composed of N second encrypted tag values arranged in the update order. The tag holder 20 may receive a second sequence of N second encrypted tag values arranged in the update order.
In step S250, the tag holder 20 may decrypt N second encrypted tag values in the second sequence into corresponding first values by using the private Key2 corresponding to the public Key1, and divide the N first values by the designated integer value M and then obtain the remainder to obtain corresponding original tag values, so as to obtain N original tag values arranged according to the update sequence.
For each second encrypted tag value E (pM + yy), the tag holder 20 may decrypt the second encrypted tag value E (pM + yy) into a first value pM + yy using the private Key2, divide the first value by the specified integer value M, and then obtain a remainder, that is, (pM + yy)% M, to obtain the original tag value yy. % is the remainder after dividing (pM + yy) by M. pM% M = 0; since M is greater than the maximum value in yy, the result after yy% M is yy. Thus, after the above processing operations, the tag holder 20 can derive the original tag value from the second encrypted tag value. Also, if the tag holder 20 does not add any special mark to the first encrypted tag value, it can successfully decrypt the correct original tag value.
The specified integer value M may be pre-agreed upon by the tag holder 20 and the feature holder 10.
In this embodiment, the feature holder superimposes the encrypted random number on the first encrypted tag value, so that even if the tag holder can decrypt the original tag value from the second encrypted tag value, the tag holder cannot acquire any privacy data rule of the feature holder, and a special mark that the tag holder can add is eliminated, thereby improving the data privacy and security of the feature holder.
In another embodiment of the present description, when the positions of equal feature values in the update sequence transmitted by the feature holder 10 are not received, it is considered that no equal feature value exists in the N feature values. The tag holder 20 may perform the first binning result by performing the adjacent binning and merging operation based on at least the N original tag values arranged in the updated order when performing step S260 according to the flowchart shown in fig. 4. The method includes the following steps S261 to S264.
Step S261, using each position corresponding to the N original tag values arranged in the update order as an initial bin, to obtain N initial bins. This embodiment may correspond to a case where N feature values in the feature holder 10 are different from each other.
Step S262, based on the original tag values in the initial bins, performing adjacent bin merging operation on the initial bins to obtain updated bin results, where the updated bin results show the updated bins corresponding to each position in the update sequence. The updated binning result may also be represented in the manner given in step S230, and the representation thereof is not described herein again.
And step S263, when each updated bin does not meet the preset bin dividing condition, taking the updated bin as an initial bin, and returning to execute the step S262.
And step S264, when each updated binning meets the preset binning condition, determining the updated binning result as a first binning result.
In this embodiment, based on the original tag values in each initial bin, adjacent bin merging operations are performed on each initial bin, and the adjacent bin merging operations may be performed in a chi-square bin splitting manner or in other bin splitting manners based on merging. The following describes the card square sub-box in detail.
When the adjacent sub-boxes are combined in a chi-square sub-box mode, the chi-square value of each pair of adjacent initial sub-boxes can be sequentially determined based on the original label value in each initial sub-box, a plurality of chi-square values are obtained, and the pair of adjacent initial sub-boxes corresponding to the minimum chi-square value is combined. In the first iteration, the initial binning is N initial binning corresponding to the N original tag values in the updated sequence. In the second and subsequent iterations, the initial binning is the updated binning in the last iteration.
For example, for initial bins 1, 2, and 3 … … and 7 arranged in sequence, chi-square values of initial bins 1 and 2, chi-square values of initial bins 2 and 3, chi-square values of initial bins 3 and 4, chi-square values of initial bins 4 and 5, chi-square values of initial bins 5 and 6, and chi-square values of initial bins 6 and 7 may be sequentially determined to obtain 6 chi-square values, and adjacent initial bins corresponding to the smallest chi-square value of the 6 chi-square values may be merged, and assuming that the chi-square values of initial bins 1 and 2 are the smallest chi-square value, initial bins 1 and 2 may be merged in this iteration.
In determining the chi-squared value for each pair of adjacent initial bins, the following may be used. Taking an example that the original tag value includes two values (an original tag value 1 and an original tag value 2, for example, 0 and 1, respectively, that is, two classes), for each pair of adjacent initial bins, for example, the initial bin 1 and the initial bin 2, each parameter in table 3 is counted.
TABLE 3
Original tag value 1 Original tag value 2
Initial binning 1 A11 A12 R1
Initial binning 2 A21 A22 R2
C1 C2
Where a11 denotes the number of samples belonging to original label value 1 in initial bin 1, a12 denotes the number of samples belonging to original label value 2 in initial bin 1, a21 denotes the number of samples belonging to original label value 1 in initial bin 2, a22 denotes the number of samples belonging to original label value 2 in initial bin 2, R1 denotes the number of samples in initial bin 1, R1= a11+ a12, R2 denotes the number of samples in initial bin 2, R2= a21+ a22, C1 denotes the number of samples belonging to original label value 1 in two bins, C2 denotes the number of samples belonging to original label value 2 in two bins, C1= a11+ a21, C2= a12+ a 22. A11, a21, a12 and a22 are understood as actual frequencies.
Then, the respective desired frequencies E11, E12, E21 and E22 are determined based on the respective parameters in table 3, see table 4.
TABLE 4
Original tag value 1 Original tag value 2
Initial binning 1 E11=(R1/N)*C1 E12=(R1/N)*C2
Initial binning 2 E21=(R2/N)*C1 E22=(R2/N)*C2
Where N is the total number of samples. Using the data in the two tables above (tables 3 and 4), the chi-squared values for initial bin 1 and initial bin 2 were calculated using the following equation (1):
Figure DEST_PATH_IMAGE016
Figure DEST_PATH_IMAGE018
wherein m is the number of sub-boxes corresponding to the chi-square value, the value is 2, and n is the number of kinds of original label valuesTo this end, the value in the above example is 2,
Figure DEST_PATH_IMAGE020
the chi-square value is obtained. The calculation method of chi-square value when n takes a larger value can be known from the above formula (1).
By utilizing the method, the chi-square value of each pair of adjacent initial binning can be obtained, the binning result is updated, whether the updated binning result meets the preset binning condition or not can be judged, if not, the step S263 is executed, and iteration is continued; if so, step S264 is executed to end the iteration and determine a first binning result.
The preset binning conditions may include: the total number of the plurality of update bins reaches a preset number. The preset number may be determined according to an empirical value, and may be a specific value, or may be a value range, for example, a value between 5 and 8, or another value greater than 1, or a range value similar to [5,8 ]. When the preset number takes the range value, whether the total number of the plurality of updated sub-boxes reaches the preset number is judged, and whether the total number of the plurality of updated sub-boxes is within the range value corresponding to the preset number can be judged.
In step S262, in any iteration, the total number of updated bins may be directly determined, and it is determined whether the total number reaches a preset number.
Or, when the adjacent binning merging operation is performed in a chi-square binning manner, the preset binning conditions may include: the chi-square value of any pair of the plurality of update sub-boxes is greater than a preset threshold value. The preset threshold may be determined based on empirical values.
In step S262, in any iteration after the first iteration, after the updated binning result is obtained, the chi-square value of any pair of updated adjacent bins may be calculated in the current iteration, and it is determined whether the chi-square value of each pair of adjacent initial bins is smaller than the preset threshold, if smaller, the updated binning is used as the initial binning, the next iteration is started, and if not smaller, the updated binning result is determined as the first binning result, and the iteration is ended.
In another embodiment, the implementation process may be modified based on the chi-square binning, for example, based on the original label values in each initial bin, the chi-square values of three adjacent initial bins are sequentially calculated for the three adjacent initial bins to obtain a plurality of chi-square values, and the three adjacent initial bins corresponding to the minimum chi-square value are combined. Of course, the above-mentioned adjacent three initial bins may be replaced by adjacent four initial bins, five initial bins, and so on. When the granularity is different during combination, the binning and combining precision is also different, and the calculation efficiency is also different accordingly. The smaller the granularity is, the higher the box-dividing and combining precision is, the larger the calculated amount is, and the efficiency is relatively low.
In this embodiment, the tag holder can perform binning locally based on the second sequence, so that supervised binning based on tags can be smoothly implemented.
In another embodiment of this embodiment, the tag holder 20 receives the positions of equal feature values in the update sequence sent by the feature holder 10 in addition to the second sequence sent by the feature holder. When the tag holder 20 executes step S260, that is, when performing adjacent binning and merging operations based on at least N original tag values arranged in the update order to obtain a first binning result, the following procedure may be performed, specifically including steps 1a to 4 a.
Step 1a, determining initial sub-boxes corresponding to N original label values arranged according to an updating sequence based on positions of equal characteristic values in the updating sequence. This embodiment may correspond to a case where there is an equal feature value among N feature values in the feature value holder 10.
Specifically, in this step, based on the positions of the equal feature values in the update sequence, for N original tag values arranged according to the update sequence, the original tag values at the positions of different feature values are divided into different initial bins, and the original tag values at the positions of the same feature values are divided into the same initial bins.
This step is explained, for example, with reference to the example shown in fig. 3. For 16 positions corresponding to the 16 samples in fig. 3, the 1 st position to the 16 th position are respectively arranged in a left-to-right order. The tag holder 20 has received the positions of the equivalent feature values in the update sequence, that is, has received information indicating: in fig. 3, the 1 st and 2 nd positions are equal eigenvalues, the 4 th to 7 th positions are equal eigenvalues, the 8 th and 9 th positions are equal eigenvalues, the 10 th to 12 th positions are equal eigenvalues, the 13 th and 14 th positions are equal eigenvalues, and the 15 th and 16 th positions are equal eigenvalues.
Each position in the update sequence is in one-to-one correspondence with the positions of the N original tag values arranged according to the update sequence. Accordingly, for N original tag values arranged in the update order, the original tag values at the 1 st and 2 nd positions may be divided into the initial bin 1, the original tag value at the 3 rd position may be divided into the initial bin 2, the original tag values at the 4 th to 7 th positions may be divided into the initial bin 3, the original tag values at the 8 th and 9 th positions may be divided into the initial bin 4, the original tag values at the 10 th to 12 th positions may be divided into the initial bin 5, the original tag values at the 13 th and 14 th positions may be divided into the initial bin 6, and the original tag values at the 15 th and 16 th positions may be divided into the initial bin 7. The initial binning determined in this way can achieve that subsequent binning results do not divide equal feature values into different bins.
And 2a, performing adjacent box separation and combination operation on each initial box separation based on the original label value in each initial box separation to obtain an updated box separation result, wherein the updated box separation corresponding to each position in the updating sequence is shown.
And 3a, when each updated sub-box does not meet the preset sub-box condition, taking the updated sub-box as an initial sub-box, and returning to execute the step 2 a.
And 4a, when each updated binning meets a preset binning condition, determining the updated binning result as a first binning result.
In each step of this embodiment, except for step 1a, other steps, for example, steps 2a to 4a, are completely the same as the embodiment shown in fig. 4, and specific description may refer to the description of steps S262 to S264 corresponding to the embodiment shown in fig. 4, and are not repeated here.
The various embodiments described above provide a merging-based binning method in feature binning. Based on the same inventive concept, in an application scenario that the characteristics and labels of the sample are distributed in different owners, each owner has a requirement on privacy protection on respective data, and the own data cannot be output in a clear text, the embodiment of the present specification further provides another two-party combined binning method, which is performed based on splitting and binning. In this embodiment, the tag holder 20 homomorphically encrypts the original tag value to obtain a first encrypted tag value, and sends it to the feature holder 10; the feature holder 10 reorders the feature values, processes the feature values according to the incidence relation between the feature values and the received first encrypted tag values to obtain second encrypted tag values in an updating sequence, and sends the second encrypted tag values in the updating sequence to the tag holder 20; the label holder 20 performs a binning splitting operation based on the received information to obtain a first binning result, and sends the first binning result to the feature holder 10; the feature holder 10 performs binning on the feature values in the update sequence based on the first binning result to obtain a feature binning result. Therefore, the whole interaction process does not have any plaintext data transmission, the supervision and the box separation of the characteristics are realized, and the privacy and the safety of the privacy data are ensured as much as possible by adopting homomorphic encryption. The specific process can be seen in the embodiment shown in fig. 5.
Fig. 5 is a supervised feature binning method based on privacy protection according to an embodiment, which is performed by a feature holder 10, where the feature holder stores feature values of a first feature of N samples, original tag values of the N samples are stored in a tag holder 20, and the N samples are arranged in a predetermined order. The method includes the following steps S510-S580.
In step S510, the tag holder 20 uses the public key to homomorphically encrypt the N original tag values into corresponding first encrypted tag values, and sends the N first encrypted tag values arranged according to the predetermined sequence to the feature holder, and the feature holder 10 may receive the N first encrypted tag values arranged according to the predetermined sequence sent by the tag holder 20.
In step S520, the feature holder 10 associates the N first encrypted tag values with the N feature values of the first feature, respectively, based on the predetermined sequence, to obtain an association relationship.
In step S530, the feature holder 10 reorders the N feature values according to the value sizes to obtain a first sequence of N feature values arranged in the update order, and processes the first sequence of N second encrypted tag values arranged in the update order based on the association relationship to obtain a second sequence of N second encrypted tag values arranged in the update order.
In step S540, the feature holder 10 at least transmits the second sequence to the tag holder 20, and the tag holder 20 receives the second sequence transmitted by the feature holder 10.
The feature holder 10 may further determine whether there is an equal feature value in the N feature values, and if not, directly send the second sequence to the tag holder 20; if there are equal eigenvalues, then the positions of equal eigenvalues in the update sequence described above may be determined based on the N eigenvalues in the first sequence, and the second sequence and the positions of equal eigenvalues in the update sequence may be sent to the tag holder 20. When binning feature values, equal feature values should be split into the same bin, but not into different bins. Thus, when there is an equal feature value among the N feature values, the feature holder 10 can also transmit the position where the equal feature value exists in the update order to the tag holder 20.
In step S550, the tag holder 20 decrypts the N second encrypted tag values in the second sequence into corresponding original tag values by using the private key corresponding to the public key, so as to obtain N original tag values arranged in the update order. When the second encrypted tag value is equal to the corresponding first encrypted tag value, the second encrypted tag value may be directly decrypted by using a private key to obtain a corresponding original tag value.
The specific implementation of the steps S510 to S550 can be the same as that described in the steps S210 to S250, and for the specific description, reference can be made to the steps S210 to S250, which is not described herein again.
In step S560, the tag holder 20 performs splitting and binning operation at least based on the N original tag values arranged according to the update sequence, so as to obtain a first binning result, where the first binning result shows first binning corresponding to each position in the update sequence.
In step S570, the tag holder 20 sends the first binning result to the feature holder 10, and the feature holder 10 receives the first binning result sent by the tag holder 20, where the first binning result is shown for each position in the updating sequence.
When the tag holder 20 performs splitting and binning operation based on the N original tag values arranged in the second sequence according to the update sequence, Best-KS binning may be performed, or other splitting-based binning methods, for example, a binning method based on minimum entropy may be used.
Best-KS (Kolmogorov-Smirnov) binning can be used to evaluate the model's ability to differentiate risk, which can describe the difference between accumulated samples of different labels when feature data is distributed in different intervals, a top-down (split-based) data discretization method.
The first binning result may be represented in the manner given in step S270, and for a specific implementation, reference may be made to the description in step S270, which is not described herein again.
In step S580, the feature holder 10 performs binning on the feature values at each position in the first sequence according to the first binning result to obtain a feature binning result. For a detailed description of this step, refer to step S280, which is not described herein again.
As can be seen from the above, in this embodiment, the tag holder may interact with the feature holder through homomorphic encryption and homomorphic decryption, perform further splitting and binning operation based on the encrypted tag value, and send the obtained first binning result to the feature holder. The whole interaction process does not send any plaintext data, simultaneously realizes the supervision and the box separation of the characteristics, and ensures the privacy and the safety of the privacy data as far as possible by homomorphic encryption.
In another embodiment of the present specification, when the N original tag values are integers, the feature holder 10 in step S530 processes a second sequence of N second encrypted tag values arranged in the update order based on the association relationship, and may perform the following steps:
generating a corresponding random number p for any one of the N first encryption tag values, multiplying the random number p by a designated integer value M to obtain a transformed random number pM, homomorphically encrypting the transformed random number pM into an encryption random number E (pM) by using a public Key1, homomorphically adding the encryption random number E (pM) and the first encryption tag value E (yy) to obtain a second encryption tag value E (pM + yy), and determining a second sequence consisting of the N second encryption tag values arranged in an updating sequence based on the incidence relation. Wherein the specified integer value M is greater than the maximum of the N original tag values. Other remarks are made with reference to the preceding examples.
After such processing is performed on the first encrypted tag value, the second sequence transmitted by the feature holder 10 in step S540 is composed of N second encrypted tag values arranged in the update order. The tag holder 20 may receive a second sequence of N second encrypted tag values arranged in the update order.
In step S550, the tag holder 20 may decrypt N second encrypted tag values in the second sequence into corresponding first values by using the private Key2 corresponding to the public Key1, and divide the N first values by the designated integer value M and then obtain the remainder, thereby obtaining the corresponding original tag value.
In this embodiment, the feature holder superimposes the encrypted random number on the first encrypted tag value, so that even if the tag holder can decrypt the original tag value from the second encrypted tag value, the tag holder cannot acquire any privacy data rule of the feature holder, and a special mark that the tag holder can add is eliminated, thereby improving the data privacy and security of the feature holder.
In another embodiment of the present description, when the positions of equal feature values in the update sequence transmitted by the feature holder 10 are not received, it is considered that no equal feature value exists in the N feature values. When the label holder 20 performs splitting and binning operation at least based on the N original label values arranged according to the update sequence to obtain a first binning result, step S560 may be performed according to the following iterative procedure, which specifically includes the following steps 1b to 4 b.
And step 1b, taking the N original label values arranged according to the updating sequence as an initial box.
And 2b, aiming at any one initial box, determining a splitting point of the initial box based on the original label value in the initial box, splitting and box-dividing the initial box by the splitting point to obtain an updated box-dividing result, wherein the updated box-dividing result shows the updated box-dividing corresponding to each position in the updating sequence.
And 3b, when each updated sub-box does not meet the preset sub-box condition, taking the updated sub-box as an initial sub-box, and returning to execute the step 2 b.
And 4b, determining the updated binning result as a first binning result when each updated binning meets a preset binning condition.
Initially, there is one initial bin, and after the first iteration, the initial bin is divided into 2 updated bins. In the second iteration, the number of the initial boxes is 2, the initial boxes are divided into 4 updated boxes, and the subsequent box dividing process is carried out in sequence until the divided updated boxes meet the preset box dividing condition. The preset box separating conditions comprise: the total number of the plurality of update bins reaches a preset number.
See figure 6 for a schematic diagram of the split-based binning process. In the first iteration, all the characteristic values of the business income form a first box, a splitting point 1 is determined from the first box, and the first box is split and boxed by the splitting point 1 to obtain a box 1 and a box 2. In the second iteration, the split point 2 of the bin 1 and the split point 3 of the bin 2 are determined by taking the bin 1 and the bin 2 as the first bins respectively. Split point 2 is the point between sample 5 and sample 8 and split point 3 is the point between sample 15 and sample 12. Splitting the sub-box 1 and the sub-box 2 by the splitting point 2 and the splitting point 3 respectively to obtain sub-boxes 3 to 6. And stopping iteration when the total number of the bins reaches a preset number.
In this embodiment, when determining the splitting point of the initial binning based on the original tag value in the initial binning in step 2b, Best-KS binning may be performed, or other splitting-based binning methods may also be used. This will be described in detail below using Best-KS binning as an example.
When the splitting point of each initial split box is determined, regarding any one initial split box, taking a point between each pair of adjacent original label values in the initial split box as a to-be-selected splitting point, dividing the initial split box into two sub-split boxes, and determining the KS sum value of the two sub-split boxes by adopting a Best-KS algorithm based on the original label values in the two sub-split boxes to be used as the feature discrimination corresponding to the to-be-selected splitting point; and selecting the splitting point to be selected corresponding to the maximum feature discrimination from the plurality of splitting points to be selected as the splitting point of the initial binning.
And each initial box is divided into a plurality of initial boxes by adopting the operation to determine the splitting point.
For example, given that 5 eigenvalues of the purchase amount of a company are different from each other, and are, sample 3-11, sample 2-13, sample 5-23, sample 1-24, and sample 4-25 in sequence from left to right, all 5 eigenvalues of the purchase amount may be used as initial bins, and in the first iteration, a point between each pair of adjacent original label values in the initial bins is used as a candidate splitting point, that is, from left to right, a point between sample 3 and sample 2 is a candidate splitting point, and a point between sample 2 and sample 5 is 4 candidate splitting points, which is, one candidate splitting point … …. Each candidate splitting point may divide the initial bin into two sub-bins. For example, for the point to be split between sample 5 and sample 1, the initial bin may be divided into two left and right sub-bins, i.e., sub-bin 1 consisting of characteristic values 11, 23 and sub-bin 2 consisting of characteristic values 24 and 25. In determining the KS sum values of the two sub-bins 1 and 2 based on the original tag values in the two sub-bins 1 and 2, the following parameters n11, n12, n21 and n22 may be counted, and the KS value of the sub-bin 1 may be calculated using the formula | n12/n2-n11/n1|, the KS value of the sub-bin 2 may be calculated using the formula | n22/n2-n21/n1|, and then the KS sum values of the two sub-bins may be calculated. And | is an absolute value symbol.
Where n11 is the number of samples with a tag value of 1 in sub-bin 1, n12 is the number of samples with a tag value of 0 in sub-bin 1, n21 is the number of samples with a tag value of 1 in sub-bin 2, n22 is the number of samples with a tag value of 0 in sub-bin 2, n1 is the total number of samples with a tag value of 1 in the initial bin, and n2 is the total number of samples with a tag value of 0 in the initial bin. In the above example, the two-class case where the label values include 0 and 1 is described as an example, and the embodiment in the case of more classes such as the three-class case and the four-class case can be obtained from this description. The embodiment of determining the feature differentiation corresponding to one split point to be selected is given above, so that the feature differentiation of other 3 split points to be selected can be obtained, each split point to be selected corresponds to one splitting mode, and the split point to be selected corresponding to the maximum feature differentiation is selected from the multiple split points to be selected, namely, one splitting mode for initial binning is selected.
The embodiment shown in steps 1 b-4 b is only one implementation of step S560. In another embodiment, step 3b may be modified, for example, when each updated bin does not satisfy the preset binning condition, a partial updated bin is selected from each updated bin as an initial bin, and step 2b is performed.
In another embodiment of this embodiment, the tag holder 20 receives the positions of equal feature values in the update sequence sent by the feature holder 10 in addition to the second sequence sent by the feature holder. The tag holder 20 performs splitting and binning operation at least based on the N original tag values arranged according to the update sequence, and when a first binning result is obtained, step S560 may be performed according to the following iterative procedure, which specifically includes the following steps 1c to 4 c.
And step 1c, taking the N original label values arranged according to the updating sequence as an initial box.
And 2c, aiming at any one initial box, determining a splitting point which is not positioned between the equal characteristic values in the initial box based on the positions of the equal characteristic values in the updating sequence and the original label value in the initial box, splitting and box-dividing the initial box by the splitting point to obtain an updating box-dividing result, wherein the updating box-dividing result shows the updating box-dividing corresponding to each position in the updating sequence.
And 3c, when each updated sub-box does not meet the preset sub-box condition, taking the updated sub-box as an initial sub-box, and returning to execute the step 2 c.
And 4c, when each updated binning meets the preset binning condition, determining the updated binning result as a first binning result. The preset box separating conditions comprise: the total number of the plurality of update bins reaches a preset number.
Initially, there is one initial bin, and after the first iteration, the initial bin is divided into 2 updated bins. In the second iteration, the number of the initial boxes is 2, the initial boxes are divided into 4 updated boxes, and the subsequent box dividing process is carried out in sequence until the divided updated boxes meet the preset box dividing condition. The preset box separating conditions comprise: the total number of the plurality of update bins reaches a preset number.
In this embodiment, when determining the splitting point in the initial bin, which is not located at the position between the equal feature values, based on the position of the equal feature value in the update sequence and the original tag value in the initial bin in step 2c, Best-KS binning may be performed, or other splitting-based binning methods may be used. This will be described in detail below using Best-KS binning as an example.
Aiming at any one initial sub-box, based on the position of the equal characteristic value in the updating sequence, dividing the initial sub-box into two corresponding sub-boxes by taking the point between each pair of adjacent original label values as a point to be selected and split in other positions except the position of the equal characteristic value in the initial sub-box, and based on the original label values in the two sub-boxes, determining the KS sum value of the two sub-boxes by adopting a Best-KS algorithm as the characteristic discrimination of the corresponding point to be selected and split; and selecting the splitting point to be selected corresponding to the maximum feature discrimination from the plurality of splitting points to be selected as the splitting point of the initial binning.
And each initial box is divided into a plurality of initial boxes by adopting the operation to determine the splitting point.
For example, taking the revenue shown in fig. 6 as an example, there are 16 positions from left to right, i.e., 16 samples, one for each position, where sets of equal feature values are visible. Initially, the positions of the 16 feature values are taken as an initial binning whole, and in the first iteration, a point between the sample 7 and the sample 5, a point between the sample 5 and the sample 8, a point between the sample 9 and the sample 3, a point between the sample 11 and the sample 10, a point between the sample 15 and the sample 12, and a point between the sample 13 and the sample 16 are taken as splitting points to be selected respectively. Each candidate splitting point may divide the initial bin into two sub-bins. When the KS sum value of the two sub-bins 1 and 2 is determined based on the original label value of the two sub-bins 1 and 2 corresponding to any one candidate splitting point, the following parameters n11, n12, n21 and n22 can be counted, the KS value of the sub-bin 1 is calculated by adopting a formula | n12/n2-n11/n1|, the KS value of the sub-bin 2 is calculated by adopting a formula | n22/n2-n21/n1|, and then the KS sum value of the two sub-bins is calculated. And | is an absolute value symbol.
Where n11 is the number of samples with a tag value of 1 in sub-bin 1, n12 is the number of samples with a tag value of 0 in sub-bin 1, n21 is the number of samples with a tag value of 1 in sub-bin 2, n22 is the number of samples with a tag value of 0 in sub-bin 2, n1 is the total number of samples with a tag value of 1 in the initial bin, and n2 is the total number of samples with a tag value of 0 in the initial bin. In the above example, the example is described with the label value including two classes, and the embodiment in the case of more classes such as three classes, four classes, and the like can be obtained from the description. The embodiment of determining the feature differentiation corresponding to one split point to be selected is given above, so that the feature differentiation of other 5 split points to be selected can be obtained, each split point to be selected corresponds to one splitting mode, and the split point to be selected corresponding to the maximum feature differentiation is selected from the multiple split points to be selected, namely, one splitting mode for initial binning is selected.
The embodiment shown in steps 1 c-4 c is only one implementation of step S560. In another embodiment, step 3c may be modified, for example, when each updated bin does not satisfy the preset binning condition, a partial updated bin is selected from each updated bin as an initial bin, and step 2c is performed.
The related descriptions in the embodiments of fig. 5 and fig. 6 can be referred to the related descriptions in the embodiment of fig. 2, and the embodiments can be referred to each other.
The foregoing describes certain embodiments of the present specification, and other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily have to be in the particular order shown or in sequential order to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
Fig. 7 is a schematic block diagram of a supervised feature binning apparatus based on privacy protection according to an embodiment. The apparatus 700 is deployed in a feature holder, which may be a variety of computers, clusters, or devices with computing processing capabilities. The characteristic holder stores characteristic values of first characteristics of N samples, original label values of the N samples are stored in the label holder, and the N samples are arranged according to a set sequence. This device embodiment corresponds to the method embodiment shown in fig. 2. The apparatus 700 comprises:
an obtaining module 710 configured to obtain N first encrypted tag values arranged according to the predetermined sequence, where each first encrypted tag value is obtained by homomorphically encrypting a corresponding original tag value using a public key;
an association module 720, configured to associate the N first encrypted tag values with the N feature values of the first feature, respectively, based on the predetermined order, to obtain an association relationship;
the reordering binning module 730 is configured to reorder the N eigenvalues according to the value sizes to obtain a first sequence composed of N eigenvalues arranged according to an update order, and process to obtain a second sequence composed of N second encrypted tag values arranged according to the update order based on the association relationship;
a first sending module 740, configured to send at least the second sequence to the tag holder, so that the tag holder performs adjacent binning and merging operations based on at least the second sequence to obtain a first binning result;
a first receiving module 750 configured to receive the first binning result sent by the tag holder, wherein the first binning result shows first binning corresponding to each position in the updating sequence;
a first binning module 760 configured to bin the feature values at the positions in the first sequence according to the first binning result to obtain a feature binning result.
In one embodiment, the N original tag values are integers; the rearrangement module 730, when processing to obtain a second sequence formed by the N second encrypted tag values arranged according to the update sequence based on the association relationship, includes:
generating a corresponding random number for any one of the N first encrypted tag values; multiplying the random number by a specified integer value to obtain a transformed random number; homomorphically encrypting the transformed random number into an encrypted random number using the public key; adding the encrypted random number and the first encrypted tag value in a homomorphic manner to obtain a second encrypted tag value; wherein the specified integer value is greater than a maximum of the N original tag values;
and determining a second sequence consisting of the N second encrypted tag values arranged according to the updating sequence based on the incidence relation.
In one embodiment, the first sending module 740 is configured to send the second sequence directly to the tag holder if there is no equal eigenvalue of the N eigenvalues.
In a specific embodiment, the first sending module 740 is configured to, in a case that there is an equal eigenvalue in the N eigenvalues, determine, based on the N eigenvalues in the first sequence, a position where the equal eigenvalue is located in the update order, and send the second sequence and the position where the equal eigenvalue is located in the update order to the tag holder.
In one embodiment, the first box splitting module 760 is specifically configured to:
and respectively corresponding each position in the first classification result with each position in the first sequence, and determining the first classification of each position in the first classification result as the classification of the characteristic value of the corresponding position in the first sequence.
In one embodiment, the positions of the equal feature values in the update sequence are represented by one of the following ways:
preset spacers exist among the positions in the updating sequence and are used for marking the positions of the same characteristic values;
and each position in the updating sequence is represented by a one-dimensional bitmap, and the positions of the equal characteristic values in each position are distinguished by a specified numerical value distribution rule in the one-dimensional bitmap.
Fig. 8 is a schematic block diagram of a supervised feature binning apparatus based on privacy protection according to an embodiment. The apparatus 800 is deployed in a tag holder, which may be a variety of computers, clusters, or devices with computing processing capabilities. The label holder stores original label values of N samples, the characteristic value of a first characteristic in the N samples is stored in the characteristic holder, and the N samples are arranged according to a set sequence. This embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2. The apparatus 800 comprises:
an encryption module 810 configured to homomorphically encrypt the N original tag values into corresponding first encrypted tag values using a public key, and send the N first encrypted tag values arranged in the predetermined order to the feature holder;
a second receiving module 820 configured to receive at least a second sequence transmitted by the feature holder; the second sequence is composed of N second encryption tag values arranged according to an updating sequence;
a decryption module 830, configured to decrypt, using a private key corresponding to the public key, the N second encrypted tag values in the second sequence into corresponding original tag values, to obtain N original tag values arranged according to an update sequence;
a second binning module 840 configured to perform adjacent binning merging operations based on at least the N original tag values arranged in the update order to obtain a first binning result, where the first binning result shows first binning corresponding to each position in the update order;
a second sending module 850 configured to send the first binned result to the feature holder.
In one embodiment, the N original tag values are integers; a decryption module 830, configured to decrypt, using a private key corresponding to the public key, the N second encrypted tag values in the second sequence into corresponding first values, and divide the N first values by a specified integer value respectively and then obtain a remainder to obtain corresponding original tag values; wherein the specified integer value is greater than a maximum of the N original tag values.
In one embodiment, the second binning module 840 is specifically configured to:
taking each position corresponding to the N original label values arranged according to the updating sequence as an initial sub-box to obtain N initial sub-boxes;
performing adjacent binning merging operation on each initial binning based on the original label value in each initial binning to obtain an updated binning result, wherein the updated binning corresponding to each position in the updating sequence is shown;
when each updated sub-box does not meet the preset sub-box condition, the updated sub-box is used as the initial sub-box, the original label value based on each initial sub-box is returned to be executed, and adjacent sub-box merging operation is carried out on each initial sub-box;
and when each updated binning meets the preset binning condition, determining the updated binning result as a first binning result.
In one embodiment, in addition to receiving the second sequence sent by the feature holder, the positions of equal feature values in the update sequence sent by the feature holder are also received; the second box splitting module 840 is specifically configured to:
determining initial sub-boxes corresponding to the N original label values arranged according to the updating sequence based on the positions of the equal characteristic values in the updating sequence;
performing adjacent binning merging operation on each initial binning based on the original label value in each initial binning to obtain an updated binning result, wherein the updated binning corresponding to each position in the updating sequence is shown;
when each updated sub-box does not meet the preset sub-box condition, taking the updated sub-box as an initial sub-box, returning to execute the step of performing adjacent sub-box merging operation on each initial sub-box based on the original label value in each initial sub-box;
and when each updated binning meets the preset binning condition, determining the updated binning result as a first binning result.
In a specific embodiment, when the second binning module 840 determines the initial binning corresponding to the N original tag values arranged according to the update sequence based on the positions of the equal feature values in the update sequence, the method includes:
and based on the positions of the equal characteristic values in the updating sequence, dividing the original label values at the positions of different characteristic values into different initial bins and dividing the original label values at the positions of the same characteristic values into the same initial bins for the N original label values arranged according to the updating sequence.
In one embodiment, when the second binning module 840 performs the adjacent binning merging operation on each initial bin based on the original tag value in each initial bin, the method includes:
and sequentially determining the chi-square value of each pair of adjacent initial sub-boxes based on the original label value in each initial sub-box to obtain a plurality of chi-square values, and combining the pair of adjacent initial sub-boxes corresponding to the minimum chi-square value.
In one embodiment, the preset binning conditions include: the total number of the plurality of updating sub-boxes reaches a preset number; or when the adjacent sub-boxes are combined in a chi-square sub-box mode, the chi-square value of any pair of updated sub-boxes in the plurality of updated sub-boxes is larger than the preset threshold value.
Fig. 9 is a schematic block diagram of another supervised feature binning apparatus based on privacy protection according to an embodiment. The apparatus 900 is deployed in a tag holder, which can be a variety of computers, clusters, or devices with computing processing capabilities. The label holder stores original label values of N samples, the feature value of a first feature of the N samples is stored in the feature holder, and the N samples are arranged in a predetermined order. This embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 5. The apparatus 900 comprises:
an encryption module 910 configured to homomorphically encrypt the N original tag values into corresponding first encrypted tag values using a public key, and send the N first encrypted tag values arranged according to the predetermined order to the feature holder;
a second receiving module 920, configured to receive the second sequence at least sent by the feature holder, where the second sequence is composed of N second encryption tag values arranged according to an update order;
a decryption module 930 configured to decrypt, using a private key corresponding to the public key, the N second encrypted tag values in the second sequence into corresponding original tag values, to obtain N original tag values arranged according to the update sequence;
a third binning module 940, configured to perform splitting binning operation at least based on the N original tag values arranged according to the update sequence to obtain a first binning result, where the first binning result shows first binning corresponding to each position in the update sequence;
a second sending module 950 configured to send the first binned result to the feature holder.
In one embodiment, the N original tag values are integers; the decryption module 930 is specifically configured to:
decrypting the N second encrypted tag values in the second sequence into corresponding first values by using a private key corresponding to the public key, dividing the N first values by a specified integer value respectively, and then taking a remainder to obtain corresponding original tag values; wherein the specified integer value is greater than a maximum of the N original tag values.
In a specific embodiment, the third binning module 940 is specifically configured to:
taking the N original label values arranged according to the updating sequence as an initial box;
for any one initial bin, determining a splitting point of the initial bin based on an original label value in the initial bin, splitting and binning the initial bin at the splitting point to obtain an updated bin result, wherein the updated bin result shows updated bins corresponding to each position in the updating sequence;
when each updated sub-box does not meet the preset sub-box condition, taking the updated sub-box as the initial sub-box, returning to execute the step of determining a splitting point of the initial sub-box based on an original label value in the initial sub-box aiming at any one initial sub-box;
and when each updated binning meets the preset binning condition, determining the updated binning result as a first binning result.
In a specific embodiment, the third binning module 940, for any one initial bin, when determining the splitting point of the initial bin based on the original tag value in the initial bin, includes:
aiming at any one initial sub-box, respectively taking a point between each pair of adjacent original label values in the initial sub-box as a to-be-selected splitting point, dividing the initial sub-box into two corresponding sub-boxes, and determining KS sum values of the two sub-boxes by adopting a Best-KS algorithm based on the original label values in the two sub-boxes to be used as feature discrimination degrees of the corresponding to-be-selected splitting points; and selecting the splitting point to be selected corresponding to the maximum feature discrimination from the plurality of splitting points to be selected as the splitting point of the initial binning.
In one embodiment, in addition to receiving the second sequence sent by the feature holder, the positions of equal feature values in the update sequence sent by the feature holder are also received; the third box splitting module 940 is specifically configured to:
taking the N original label values arranged according to the updating sequence as an initial box;
for any initial binning, determining a splitting point which is not located at a position between equal characteristic values in the initial binning based on the position of the equal characteristic value in the updating sequence and an original label value in the initial binning, and splitting and binning the initial binning by using the splitting point to obtain an updated binning result, wherein the updated binning result shows the updated binning corresponding to each position in the updating sequence;
when each updated sub-box does not meet the preset sub-box condition, taking the updated sub-box as the initial sub-box, returning to execute the step of determining a splitting point of the initial sub-box based on an original label value in the initial sub-box aiming at any one initial sub-box;
and when each updated binning meets the preset binning condition, determining the updated binning result as a first binning result.
In a specific embodiment, the third binning module 940, for any one initial binning, when determining a splitting point in the initial binning which is not located at a position between equal feature values based on the position of the equal feature value in the update sequence and the original tag value in the initial binning, includes:
for any initial sub-box, based on the position of the equal characteristic value in the updating sequence, dividing the initial sub-box into two corresponding sub-boxes by taking a point between each pair of adjacent original label values in other positions except the position of the equal characteristic value in the initial sub-box as a to-be-selected splitting point, and based on the original label values in the two sub-boxes, determining KS sum values of the two sub-boxes by adopting a Best-KS algorithm to serve as the characteristic discrimination of the corresponding to-be-selected splitting point; and selecting the splitting point to be selected corresponding to the maximum feature discrimination from the plurality of splitting points to be selected as the splitting point of the initial binning.
In one embodiment, the pre-set binning conditions comprise: the total number of the plurality of update bins reaches a preset number.
The above device embodiments correspond to the method embodiments, and for specific description, reference may be made to the description of the method embodiments, which is not described herein again. The device embodiment is obtained based on the corresponding method embodiment, has the same technical effect as the corresponding method embodiment, and for the specific description, reference may be made to the corresponding method embodiment.
Embodiments of the present specification also provide a computer-readable storage medium having a computer program stored thereon, which, when executed in a computer, causes the computer to perform the method of any one of fig. 1 to 6.
The present specification also provides a computing device, including a memory and a processor, where the memory stores executable code, and the processor executes the executable code to implement the method described in any one of fig. 1 to 6.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the storage medium and the computing device embodiments, since they are substantially similar to the method embodiments, they are described relatively simply, and reference may be made to some descriptions of the method embodiments for relevant points.
Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in connection with the embodiments of the invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The above-mentioned embodiments further describe the objects, technical solutions and advantages of the embodiments of the present invention in detail. It should be understood that the above description is only exemplary of the embodiments of the present invention, and is not intended to limit the scope of the present invention, and any modification, equivalent replacement, or improvement made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims (25)

1. A supervised feature binning method based on privacy protection is executed by a feature holder, wherein the feature holder stores feature values of first features of N samples, original tag values of the N samples are stored in the tag holder, and the N samples are arranged according to a set sequence; the method comprises the following steps:
acquiring N first encrypted tag values which are sent by the tag holder and arranged according to the set sequence, wherein each first encrypted tag value is obtained by using a public key to homomorphically encrypt a corresponding original tag value;
on the basis of the established sequence, respectively associating the N first encryption tag values with N characteristic values of the first characteristic to obtain an association relation;
reordering the N characteristic values according to the value size to obtain a first sequence consisting of N characteristic values arranged according to an updating sequence, and processing to obtain a second sequence consisting of N second encryption tag values arranged according to the updating sequence based on the incidence relation;
at least sending the second sequence to the tag holder, so that the tag holder performs characteristic binning based on at least the second sequence to obtain a first binning result;
receiving the first binning result sent by the label holder, wherein the first binning result shows first binning corresponding to each position in the updating sequence;
and according to the first binning result, binning the characteristic values of all positions in the first sequence to obtain a characteristic binning result.
2. The method of claim 1, N original tag values being integers; the step of processing to obtain a second sequence of N second encrypted tag values arranged in the update order based on the association relationship includes:
generating a corresponding random number for any one of the N first encrypted tag values; multiplying the random number by a specified integer value to obtain a transformed random number; homomorphically encrypting the transformed random number into an encrypted random number using the public key; adding the encrypted random number and the first encrypted tag value in a homomorphic manner to obtain a second encrypted tag value; wherein the specified integer value is greater than a maximum of the N original tag values;
and determining a second sequence consisting of the N second encrypted tag values arranged according to the updating sequence based on the incidence relation.
3. The method of claim 1, wherein the step of sending at least the second sequence to the tag holder in the absence of an equal eigenvalue of the N eigenvalues comprises: directly sending the second sequence to the tag holder.
4. The method of claim 1, wherein said step of sending at least said second sequence to said tag holder if there is an equal eigenvalue of said N eigenvalues comprises:
and determining the positions of the equal characteristic values in the updating sequence based on the N characteristic values in the first sequence, and sending the positions of the equal characteristic values in the second sequence and the updating sequence to the label holder.
5. The method of claim 1, wherein the step of binning the feature values for each position in the first sequence according to the first binning result comprises:
and respectively corresponding each position in the first classification result with each position in the first sequence, and determining the first classification of each position in the first classification result as the classification of the characteristic value of the corresponding position in the first sequence.
6. The method of claim 1, wherein the positions of the equivalent eigenvalues in the update order are represented in one of the following ways:
preset spacers exist among the positions in the updating sequence and are used for marking the positions of the same characteristic values;
and each position in the updating sequence is represented by a one-dimensional bitmap, and the positions of the equal characteristic values in each position are distinguished by a specified numerical value distribution rule in the one-dimensional bitmap.
7. A supervised feature binning method based on privacy protection is executed by a label holder, wherein the label holder stores original label values of N samples, the feature value of a first feature in the N samples is stored in the label holder, and the N samples are arranged according to a set sequence; the method comprises the following steps:
using a public key to homomorphically encrypt the N original label values into corresponding first encrypted label values, and sending the N first encrypted label values arranged according to the set sequence to the feature holder;
receiving a second sequence transmitted by at least the feature holder; the second sequence is composed of N second encryption tag values arranged according to an updating sequence;
decrypting the N second encrypted tag values in the second sequence into corresponding original tag values by using a private key corresponding to the public key to obtain N original tag values arranged according to the updating sequence;
performing adjacent binning and merging operation at least based on the N original label values arranged according to the updating sequence to obtain a first binning result, wherein the first binning result shows that each position in the updating sequence corresponds to the first binning;
sending the first binned result to the feature holder.
8. The method of claim 7, the N original tag values being integers; the step of decrypting the N second encrypted tag values in the second sequence into corresponding original tag values using a private key corresponding to the public key comprises:
decrypting the N second encrypted tag values in the second sequence into corresponding first values by using a private key corresponding to the public key, dividing the N first values by a specified integer value respectively, and then taking a remainder to obtain corresponding original tag values; wherein the specified integer value is greater than a maximum of the N original tag values.
9. The method of claim 7, wherein said step of performing a binning merge operation on adjacent bins based on at least the N original tag values arranged in the updated order to obtain a first bin result comprises:
taking each position corresponding to the N original label values arranged according to the updating sequence as an initial sub-box to obtain N initial sub-boxes;
performing adjacent binning merging operation on each initial binning based on the original label value in each initial binning to obtain an updated binning result, wherein the updated binning corresponding to each position in the updating sequence is shown;
when each updated sub-box does not meet the preset sub-box condition, taking the updated sub-box as an initial sub-box, returning to execute the step of performing adjacent sub-box merging operation on each initial sub-box based on the original label value in each initial sub-box;
and when each updated binning meets the preset binning condition, determining the updated binning result as a first binning result.
10. The method of claim 7, wherein the positions of equal feature values in the update sequence transmitted by the feature holder are received in addition to the second sequence transmitted by the feature holder; the step of performing adjacent binning merge operation based on at least the N original tag values arranged in the update order includes:
determining initial sub-boxes corresponding to the N original label values arranged according to the updating sequence based on the positions of the equal characteristic values in the updating sequence;
performing adjacent binning merging operation on each initial binning based on the original label value in each initial binning to obtain an updated binning result, wherein the updated binning corresponding to each position in the updating sequence is shown;
when each updated sub-box does not meet the preset sub-box condition, taking the updated sub-box as an initial sub-box, returning to execute the step of performing adjacent sub-box merging operation on each initial sub-box based on the original label value in each initial sub-box;
and when each updated binning meets the preset binning condition, determining the updated binning result as a first binning result.
11. The method of claim 10, wherein the step of determining the initial bins corresponding to the N original tag values arranged in the update order based on the positions of the equal feature values in the update order comprises:
and based on the positions of the equal characteristic values in the updating sequence, dividing the original label values at the positions of different characteristic values into different initial bins and dividing the original label values at the positions of the same characteristic values into the same initial bins for the N original label values arranged according to the updating sequence.
12. The method of claim 9 or 10, wherein the step of performing a neighbor bin merge operation on each initial bin based on the original label value in each initial bin comprises:
and sequentially determining the chi-square value of each pair of adjacent initial sub-boxes based on the original label value in each initial sub-box to obtain a plurality of chi-square values, and combining the pair of adjacent initial sub-boxes corresponding to the minimum chi-square value.
13. The method of claim 9 or 10, the preset binning conditions comprising: the total number of the plurality of updating sub-boxes reaches a preset number; or when the adjacent sub-boxes are combined in a chi-square sub-box mode, the chi-square value of any pair of updated sub-boxes in the plurality of updated sub-boxes is larger than the preset threshold value.
14. A supervised feature binning method based on privacy protection is executed by a label holder, wherein the label holder stores original label values of N samples, the feature value of a first feature in the N samples is stored in the label holder, and the N samples are arranged according to a set sequence; the method comprises the following steps:
using a public key to homomorphically encrypt the N original label values into corresponding first encrypted label values, and sending the N first encrypted label values arranged according to the set sequence to the feature holder;
receiving a second sequence at least sent by the feature holder, wherein the second sequence is composed of N second encryption tag values arranged according to an updating sequence;
decrypting the N second encrypted tag values in the second sequence into corresponding original tag values by using a private key corresponding to the public key to obtain N original tag values arranged according to the updating sequence;
splitting and binning operation is carried out at least on the basis of the N original label values arranged according to the updating sequence to obtain a first binning result, wherein the first binning result shows that the first binning result corresponds to each position in the updating sequence;
sending the first binned result to the feature holder.
15. The method of claim 14, the N original tag values being integers; the step of decrypting the N second encrypted tag values in the second sequence into corresponding original tag values using a private key corresponding to the public key comprises:
decrypting the N second encrypted tag values in the second sequence into corresponding first values by using a private key corresponding to the public key, dividing the N first values by a specified integer value respectively, and then taking a remainder to obtain corresponding original tag values; wherein the specified integer value is greater than a maximum of the N original tag values.
16. The method of claim 14, wherein the step of performing a split binning operation based on at least the N original tag values arranged in the updated order to obtain a first binning result comprises:
taking the N original label values arranged according to the updating sequence as an initial box;
for any one initial bin, determining a splitting point of the initial bin based on an original label value in the initial bin, splitting and binning the initial bin at the splitting point to obtain an updated bin result, wherein the updated bin result shows updated bins corresponding to each position in the updating sequence;
when each updated sub-box does not meet the preset sub-box condition, taking the updated sub-box as an initial sub-box, returning to execute the step of determining a splitting point of the initial sub-box based on an original label value in the initial sub-box aiming at any one initial sub-box;
and when each updated binning meets the preset binning condition, determining the updated binning result as a first binning result.
17. The method of claim 16, wherein the step of determining, for any initial bin, a split point for the initial bin based on the original tag value in the initial bin comprises:
aiming at any one initial sub-box, respectively taking a point between each pair of adjacent original label values in the initial sub-box as a to-be-selected splitting point, dividing the initial sub-box into two corresponding sub-boxes, and determining KS sum values of the two sub-boxes by adopting a Best-KS algorithm based on the original label values in the two sub-boxes to be used as feature discrimination degrees of the corresponding to-be-selected splitting points; and selecting the splitting point to be selected corresponding to the maximum feature discrimination from the plurality of splitting points to be selected as the splitting point of the initial binning.
18. The method of claim 14, wherein the positions of equal feature values in the update sequence transmitted by the feature holder are received in addition to the second sequence transmitted by the feature holder; the step of performing splitting and binning operation at least based on the N original tag values arranged in the update sequence to obtain a first binning result includes:
taking the N original label values arranged according to the updating sequence as an initial box;
for any initial binning, determining a splitting point which is not located at a position between equal characteristic values in the initial binning based on the position of the equal characteristic value in the updating sequence and an original label value in the initial binning, and splitting and binning the initial binning by using the splitting point to obtain an updated binning result, wherein the updated binning result shows the updated binning corresponding to each position in the updating sequence;
when each updated sub-box does not meet the preset sub-box condition, taking the updated sub-box as the initial sub-box, returning to execute the step of determining a splitting point of the initial sub-box based on an original label value in the initial sub-box aiming at any one initial sub-box;
and when each updated binning meets the preset binning condition, determining the updated binning result as a first binning result.
19. The method of claim 18, wherein for any initial bin, determining a split point in the initial bin that is not located at a position between equal eigenvalues based on the position of the equal eigenvalue in the update order and the original label value in the initial bin comprises:
for any initial sub-box, based on the position of the equal characteristic value in the updating sequence, dividing the initial sub-box into two corresponding sub-boxes by taking a point between each pair of adjacent original label values in other positions except the position of the equal characteristic value in the initial sub-box as a to-be-selected splitting point, and based on the original label values in the two sub-boxes, determining KS sum values of the two sub-boxes by adopting a Best-KS algorithm to serve as the characteristic discrimination of the corresponding to-be-selected splitting point; and selecting the splitting point to be selected corresponding to the maximum feature discrimination from the plurality of splitting points to be selected as the splitting point of the initial binning.
20. The method of claim 16 or 18, the preset binning conditions comprising: the total number of the plurality of update bins reaches a preset number.
21. A supervised feature binning device based on privacy protection is deployed in a feature holder, wherein the feature holder stores feature values of a first feature of N samples, original tag values of the N samples are stored in a tag holder, and the N samples are arranged in a set order; the device comprises:
an obtaining module configured to obtain N first encrypted tag values arranged according to the predetermined sequence, where each first encrypted tag value is obtained by homomorphically encrypting a corresponding original tag value using a public key;
the association module is configured to associate the N first encrypted tag values with the N feature values of the first feature, respectively, based on the predetermined order, to obtain an association relationship;
the rearrangement module is configured to rearrange the N characteristic values according to the value sizes to obtain a first sequence formed by the N characteristic values arranged according to an updating sequence, and process to obtain a second sequence formed by the N second encryption tag values arranged according to the updating sequence based on the incidence relation;
a first sending module configured to send at least the second sequence to the tag holder, so that the tag holder performs feature binning based on at least the second sequence to obtain a first binning result;
a first receiving module, configured to receive the first binning result sent by the tag holder, where the first binning result shows first binning corresponding to each position in the update sequence;
and the first binning module is configured to bin the characteristic values of all positions in the first sequence according to the first binning result to obtain a characteristic binning result.
22. A supervised feature binning device based on privacy protection is deployed in a label holder, wherein the label holder stores original label values of N samples, the feature value of a first feature in the N samples is stored in the feature holder, and the N samples are arranged in a set order; the device comprises:
the encryption module is configured to homomorphically encrypt the N original tag values into corresponding first encrypted tag values by using a public key, and send the N first encrypted tag values arranged according to the set sequence to the feature holder;
a second receiving module configured to receive at least a second sequence transmitted by the feature holder; the second sequence is composed of N second encryption tag values arranged according to an updating sequence;
a decryption module configured to decrypt the N second encrypted tag values in the second sequence into corresponding original tag values using a private key corresponding to the public key, to obtain N original tag values arranged in the update order;
the second binning module is configured to perform adjacent binning merging operation at least based on the N original tag values arranged according to the update sequence to obtain a first binning result, where the first binning result shows first binning corresponding to each position in the update sequence;
a second sending module configured to send the first binned result to the feature holder.
23. A supervised feature binning device based on privacy protection is deployed in a label holder, wherein the label holder stores original label values of N samples, the feature value of a first feature in the N samples is stored in the feature holder, and the N samples are arranged in a set order; the device comprises:
the encryption module is configured to homomorphically encrypt the N original tag values into corresponding first encrypted tag values by using a public key, and send the N first encrypted tag values arranged according to the set sequence to the feature holder;
a second receiving module configured to receive a second sequence at least sent by the feature holder, where the second sequence is composed of N second encryption tag values arranged in an update order;
a decryption module configured to decrypt the N second encrypted tag values in the second sequence into corresponding original tag values using a private key corresponding to the public key, to obtain N original tag values arranged in the update order;
a third binning module configured to perform splitting binning operation at least based on the N original tag values arranged according to the update sequence to obtain a first binning result, where the first binning result shows first binning corresponding to each position in the update sequence;
a second sending module configured to send the first binned result to the feature holder.
24. A computer-readable storage medium, having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any one of claims 1-20.
25. A computing device comprising a memory having executable code stored therein and a processor that, when executing the executable code, implements the method of any of claims 1-20.
CN202010502530.1A 2020-06-05 2020-06-05 Supervision characteristic box dividing method and device based on privacy protection Active CN111401572B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010502530.1A CN111401572B (en) 2020-06-05 2020-06-05 Supervision characteristic box dividing method and device based on privacy protection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010502530.1A CN111401572B (en) 2020-06-05 2020-06-05 Supervision characteristic box dividing method and device based on privacy protection

Publications (2)

Publication Number Publication Date
CN111401572A CN111401572A (en) 2020-07-10
CN111401572B true CN111401572B (en) 2020-08-21

Family

ID=71431912

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010502530.1A Active CN111401572B (en) 2020-06-05 2020-06-05 Supervision characteristic box dividing method and device based on privacy protection

Country Status (1)

Country Link
CN (1) CN111401572B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111898765A (en) * 2020-07-29 2020-11-06 深圳前海微众银行股份有限公司 Feature binning method, device, equipment and readable storage medium
CN112597525B (en) * 2021-03-04 2021-05-28 支付宝(杭州)信息技术有限公司 Data processing method and device based on privacy protection and server
CN113362048B (en) * 2021-08-11 2021-11-30 腾讯科技(深圳)有限公司 Data label distribution determining method and device, computer equipment and storage medium
CN113449048B (en) * 2021-08-31 2021-11-09 腾讯科技(深圳)有限公司 Data label distribution determining method and device, computer equipment and storage medium
CN117459214B (en) * 2023-12-22 2024-02-23 北京天润基业科技发展股份有限公司 Feature verification method and system based on homomorphic encryption and electronic equipment

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11194462B2 (en) * 2011-08-03 2021-12-07 Avaya Inc. Exclusion of selected data from access by collaborators
CN108959187B (en) * 2018-04-09 2023-09-05 中国平安人寿保险股份有限公司 Variable box separation method and device, terminal equipment and storage medium
CN110990857B (en) * 2019-12-11 2021-04-06 支付宝(杭州)信息技术有限公司 Multi-party combined feature evaluation method and device for protecting privacy and safety

Also Published As

Publication number Publication date
CN111401572A (en) 2020-07-10

Similar Documents

Publication Publication Date Title
CN111539535B (en) Joint feature binning method and device based on privacy protection
CN111401572B (en) Supervision characteristic box dividing method and device based on privacy protection
CN111539009B (en) Supervised feature binning method and device for protecting private data
US10489604B2 (en) Searchable encryption processing system and searchable encryption processing method
Hu et al. Securing SIFT: Privacy-preserving outsourcing computation of feature extractions over encrypted image data
AU2021218110A1 (en) Learning from distributed data
US20170039487A1 (en) Support vector machine learning system and support vector machine learning method
CN113449048B (en) Data label distribution determining method and device, computer equipment and storage medium
CN115688167B (en) Method, device and system for inquiring trace and storage medium
CN106651976B (en) A kind of image encryption method based on cluster and chaos
David et al. A bounded-space near-optimal key enumeration algorithm for multi-subkey side-channel attacks
CN113362048B (en) Data label distribution determining method and device, computer equipment and storage medium
CN111143865B (en) User behavior analysis system and method for automatically generating label on ciphertext data
US11184163B2 (en) Value comparison server, value comparison encryption system, and value comparison method
Sharif et al. Classifying encryption algorithms using pattern recognition techniques
Al-Rubaie et al. Privacy-preserving PCA on horizontally-partitioned data
CN109359588A (en) The k nearest neighbor classification method of non-interactive type under a kind of new secret protection
US10831919B2 (en) Method for confidentially querying an encrypted database
Ligier et al. Privacy preserving data classification using inner-product functional encryption
Weissbart et al. Systematic side-channel analysis of curve25519 with machine learning
Ahmad et al. A secure network communication protocol based on text to barcode encryption algorithm
Pradeepthi et al. Machine learning approach for analysing encrypted data
US10650083B2 (en) Information processing device, information processing system, and information processing method to determine correlation of data
Sharif et al. Performance evaluation of classifiers used for identification of encryption algorithms
Revanna et al. A novel priority based document image encryption with mixed chaotic systems using machine learning approach

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40032967

Country of ref document: HK