CN111539535B

CN111539535B - Joint feature binning method and device based on privacy protection

Info

Publication number: CN111539535B
Application number: CN202010502513.8A
Authority: CN
Inventors: 李漓春; 张文彬
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2020-06-05
Filing date: 2020-06-05
Publication date: 2022-04-12
Anticipated expiration: 2040-06-05
Also published as: CN111539535A

Abstract

The embodiment of the specification provides a combined feature binning method and device based on privacy protection. Both parties store private data respectively. The label holder sends the N first encrypted label values after homomorphic encryption and the generated range certificate to the feature holder; after the verification based on the range certification passes, the feature holder associates the N first encrypted tag values with the N feature values, reorders the N feature values according to the value sizes to obtain a first sequence formed by the N feature values and a second sequence formed by the N second encrypted tag values which are arranged according to the updating sequence, and sends the second sequence to the tag holder; the label holder decrypts the second encrypted label value in the second sequence to obtain the original label value in each initial box, performs characteristic box separation based on the original label value to obtain a first box separation result, and sends the first box separation result to the characteristic holder; and the characteristic holder performs binning on the N characteristic values according to the first binning result.

Description

Joint feature binning method and device based on privacy protection

Technical Field

One or more embodiments of the present disclosure relate to the field of data processing technologies, and in particular, to a joint feature binning method and apparatus based on privacy protection.

Background

Binning is a method of processing features in machine learning modeling. Binning a feature is to group a (possibly large) set of feature values of the feature and treat each group as a class value, i.e., to group many values in the set into a few class values. For example, for the characteristic of age, all age values in each sample form a discrete value set from 1 to 50, and the grouping of the set may result in the following 3 bins, where age values form a bin from 1 to 15, a bin from 16 to 35, and a bin from 35 to 50. The characteristics are subjected to box separation, so that continuous variables can be discretized, and multi-state discrete variables are rarely stateful. The characteristics after binning can bring many performance improvements to model training, for example, rapid iteration to the model can be easier, the stability of the model can be improved, overfitting of the model can be reduced, and the like.

The method for separating boxes comprises an unsupervised box separating method and a supervised box separating method. In unsupervised binning, there is no need to rely on sample labels to bin features. While in supervised binning, features need to be binned in conjunction with sample labels.

In the supervised binning, one application scenario is that the characteristics and labels of the samples are distributed in different owners, and each owner has a requirement on privacy protection on respective data, and does not output own data in a clear text manner. However, both parties need supervised binning of features for the purpose of joint training of models and the like. Accordingly, improved schemes are desired that enable supervised binning of features in scenarios where features and tags are distributed among different parties, while ensuring privacy and security of private data.

Disclosure of Invention

One or more embodiments of the present specification describe a joint feature binning method and apparatus based on privacy protection, so as to implement supervised binning on features in a scenario where the features and tags are distributed in different parties, and simultaneously ensure privacy and security of private data. The specific technical scheme is as follows.

In a first aspect, a joint feature binning method based on privacy protection is provided, which is performed by a feature holder, where the feature holder stores feature values of a first feature of N samples, original tag values of the N samples are stored in a tag holder, values of the N original tag values are within a specified range, and the N samples are arranged in a predetermined order; the method comprises the following steps:

acquiring N first encrypted tag values which are sent by the tag holder and arranged according to the set sequence and corresponding range certificates; under the appointed condition, each first encrypted tag value is obtained by homomorphically encrypting the corresponding original tag value by using a public key;

verifying that the values of the original tag values corresponding to the N first encrypted tag values are within the specified range based on the range attestation;

when the verification is passed, respectively associating the N first encryption tag values with the N characteristic values of the first characteristic based on the established sequence to obtain an association relation;

reordering the N characteristic values according to the value size to obtain a first sequence consisting of N characteristic values arranged according to an updating sequence, and processing to obtain a second sequence consisting of N second encryption tag values arranged according to the updating sequence based on the incidence relation;

at least sending the second sequence to the tag holder, so that the tag holder performs characteristic binning based on at least the second sequence to obtain a first binning result;

receiving the first binning result sent by the label holder, wherein the first binning result shows first binning corresponding to each position in the updating sequence;

and according to the first binning result, binning the characteristic values of all positions in the first sequence to obtain a characteristic binning result.

In a second aspect, an embodiment provides a joint feature binning method based on privacy protection, which is performed by a tag holder, where the tag holder stores original tag values of N samples, a feature value of a first feature in the N samples is stored in the tag holder, values of the N original tag values are within a specified range, and the N samples are arranged in a given order; the method comprises the following steps:

using a public key to homomorphically encrypt N original label values into corresponding first encrypted label values, generating a range certificate based on the N first encrypted label values, and sending the N first encrypted label values arranged according to the set sequence and the range certificate to the feature holder;

receiving at least a second sequence sent by the feature holder after the feature holder has passed verification based on the scope attestation; the second sequence is composed of N second encryption tag values arranged according to an updating sequence;

decrypting the N second encrypted tag values in the second sequence into corresponding original tag values by using a private key corresponding to the public key to obtain N original tag values arranged according to the updating sequence;

performing adjacent binning and merging operation at least based on the N original label values arranged according to the updating sequence to obtain a first binning result, wherein the first binning result shows that each position in the updating sequence corresponds to the first binning;

sending the first binned result to the feature holder.

In a third aspect, the embodiment provides a joint feature binning method based on privacy protection, which is performed by a tag holder, where the tag holder stores original tag values of N samples, a feature value of a first feature in the N samples is stored in the tag holder, values of the N original tag values are in a specified range, and the N samples are arranged in a given order; the method comprises the following steps:

using a public key to homomorphically encrypt N original label values into corresponding first encrypted label values, generating a range certificate based on the N first encrypted label values, and sending the N first encrypted label values arranged according to the set sequence to the feature holder;

receiving at least the second sequence sent by the feature holder, which is sent after the feature holder passes verification based on the scope certificate, the second sequence consisting of N second encrypted tag values arranged in an update order;

splitting and binning operation is carried out at least on the basis of the N original label values arranged according to the updating sequence to obtain a first binning result, wherein the first binning result shows that the first binning result corresponds to each position in the updating sequence;

sending the first binned result to the feature holder.

In a fourth aspect, an embodiment provides a joint feature binning device based on privacy protection, which is deployed in a feature holder, where the feature holder stores feature values of a first feature of N samples, where original tag values of the N samples are stored in the tag holder, where values of the N original tag values are within a specified range, and the N samples are arranged in a given order; the device comprises:

the acquisition module is configured to acquire N first encrypted tag values which are sent by the tag holder and arranged according to the established sequence, and corresponding range certificates; under the appointed condition, each first encrypted tag value is obtained by homomorphically encrypting the corresponding original tag value by using a public key;

a verification module configured to verify, based on the range attestation, that values of original tag values corresponding to the N first encrypted tag values are within the specified range;

the association module is configured to associate the N first encrypted tag values with the N feature values of the first feature respectively based on the predetermined sequence to obtain an association relationship when the verification module passes the verification;

the rearrangement module is configured to rearrange the N characteristic values according to the value sizes to obtain a first sequence formed by the N characteristic values arranged according to an updating sequence, and process to obtain a second sequence formed by the N second encryption tag values arranged according to the updating sequence based on the incidence relation;

a first sending module configured to send at least the second sequence to the tag holder, so that the tag holder performs feature binning based on at least the second sequence to obtain a first binning result;

a first receiving module, configured to receive the first binning result sent by the tag holder, where the first binning result shows first binning corresponding to each position in the update sequence;

and the first binning module is configured to bin the characteristic values of all positions in the first sequence according to the first binning result to obtain a characteristic binning result.

In a fifth aspect, embodiments provide a joint feature binning apparatus based on privacy protection, deployed in a tag holder, where the tag holder stores original tag values of N samples, where a feature value of a first feature in the N samples is stored in the tag holder, where values of the N original tag values are within a specified range, and the N samples are arranged in a given order; the device comprises:

an encryption module configured to homomorphically encrypt N original tag values into corresponding first encrypted tag values using a public key, generate a range certificate based on the N first encrypted tag values, and send the N first encrypted tag values arranged in the predetermined order and the range certificate to the feature holder;

a second receiving module configured to receive at least a second sequence transmitted by the feature holder after the feature holder has passed the authentication based on the range attestation; the second sequence is composed of N second encryption tag values arranged according to an updating sequence;

a decryption module configured to decrypt the N second encrypted tag values in the second sequence into corresponding original tag values using a private key corresponding to the public key, to obtain N original tag values arranged in the update order;

the second binning module is configured to perform adjacent binning merging operation at least based on the N original tag values arranged according to the update sequence to obtain a first binning result, where the first binning result shows first binning corresponding to each position in the update sequence;

a second sending module configured to send the first binned result to the feature holder.

In a sixth aspect, embodiments provide a joint feature binning apparatus based on privacy protection, deployed in a tag holder, where the tag holder stores original tag values of N samples, where a feature value of a first feature in the N samples is stored in the tag holder, where values of the N original tag values are within a specified range, and the N samples are arranged in a given order; the device comprises:

a second receiving module configured to receive at least the second sequence transmitted by the feature holder, the second sequence being transmitted after the feature holder passes verification based on the range certification, the second sequence being composed of N second cryptographic tag values arranged in an update order;

a third binning module configured to perform splitting binning operation at least based on the N original tag values arranged according to the update sequence to obtain a first binning result, where the first binning result shows first binning corresponding to each position in the update sequence;

In a seventh aspect, embodiments provide a computer-readable storage medium, on which a computer program is stored, which, when executed in a computer, causes the computer to perform the method of any one of the first to third aspects.

In an eighth aspect, an embodiment provides a computing device, including a memory and a processor, where the memory stores executable code, and the processor executes the executable code to implement the method of any one of the first to third aspects.

According to the method and the device provided by the embodiment of the specification, the tag holder sends the first encrypted tag value and the range certificate to the feature holder, the feature holder associates the feature value with the encrypted tag value after the verification based on the range certificate is passed, the feature holder reorders the feature value, and the reordered encrypted tag value is sent to the tag holder. In this way, the tag holder can obtain the updated sorted original tag value by decryption, perform further feature binning operation based on the updated sorted original tag value, and send the obtained first binning result to the feature holder. The whole interaction process does not send any plaintext data, simultaneously realizes the supervision and the box separation of the characteristics, and ensures the privacy and the safety of the privacy data as far as possible by homomorphic encryption.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

FIG. 1 is a schematic diagram illustrating an implementation scenario of an embodiment disclosed herein;

FIG. 2 is a schematic diagram illustrating an exemplary interaction flow for binning between two parties;

FIG. 3 is a schematic diagram of the locations of equal feature values in a feature holder and the first bin numbers;

FIG. 4 is a flowchart illustrating an iterative implementation of step S260 in FIG. 2;

FIG. 5 is a schematic diagram illustrating an interaction flow for binning between two parties in accordance with another embodiment;

FIG. 6 is a schematic diagram of a split-based iterative binning process;

FIG. 7 is a schematic block diagram of a binning apparatus deployed at a feature holder provided by one embodiment;

FIG. 8 is a schematic block diagram of a binning apparatus deployed at a tag holder provided in one embodiment;

fig. 9 is a schematic block diagram of a binning apparatus deployed at a tag holder according to another embodiment.

Detailed Description

The scheme provided by the specification is described below with reference to the accompanying drawings.

Fig. 1 is a schematic view of an implementation scenario of an embodiment disclosed in this specification. The feature holder 10 stores feature values of first features of N samples, and the label holder 20 stores original label values of the N samples, where the N samples are arranged in a predetermined order. Wherein N is a natural number. The N samples may be used as samples in a model test set, or may be used as samples in a model training set, which is not limited in this specification.

The first feature may be any one of a plurality of features of the sample. For example, each sample may be one of the following business objects: users, goods, merchants, events, and the like. When the sample is a commodity, the characteristics of the sample may include: price, sales, etc., the labels of the sample may include categories of high-volume goods, medium-volume goods, low-volume goods, etc.; when the sample is a user, the characteristics of the sample may include: user age, income, amount of consumption, etc., the label of the sample may include a usage frequency value for a certain client. The feature value of the first feature may be a discrete feature value or a continuous feature value. The labels of the samples may be classification labels or non-classification labels, i.e., numerical labels.

In a risk control scenario, the characteristics of the sample may include user data. Users can be classified into risky users (abnormal users) and non-risky users (normal users), and the user data is private data that needs to be kept secret. In such a scenario, the label of the sample may be a classification label, and the dataset in which the sample is located may be used to train the risk control model.

The feature holder 10 stores sample identifications of the N samples and corresponding feature values (i.e., feature data) of the first feature, which are exemplarily replaced with xx in fig. 1. The tag holder 20 stores sample identifications of N samples and corresponding tag values (i.e., tag data), and the value of the tag is exemplarily replaced with yy in fig. 1. The feature holder 10 and the tag holder may share a sample ordering. The N samples are arranged in a given order, which is to be understood as meaning that the ordering of the samples by the feature holder 10 and the ordering of the labels by the label holder 20 may be such that: the nth sample of the feature holder 10 and the nth tag of the tag holder 20 are matched to be a complete sample with a tag, where N is a value less than or equal to N. For example, the 1 st sample of the feature holder 10 is used to describe the features of the user a (such as age, height of academic calendar, monthly consumption amount, etc.), the 1 st tag of the tag holder 20 is used to describe whether the user a belongs to a high-risk user (such as "1" for high-risk user and "0" for non-high-risk user), and the 1 st sample of the feature holder 10 and the 1 st tag of the tag holder 20 constitute a tagged sample describing the user a. As another example, the 16 th sample of the feature holder 10 is used to describe the features of the user P (such as age, height of academic story, monthly consumption amount, etc.), the 16 th tag of the tag holder 20 is used to describe whether the user P belongs to a high-risk user (such as "1" for high-risk user and "0" for non-high-risk user), and the 16 th sample of the feature holder 10 and the 16 th tag of the tag holder 20 constitute a tagged sample describing the user P. The above feature data and tag data both belong to privacy data.

When the characteristics and the labels of the samples are distributed in different holders, and each holder has a requirement for privacy protection on respective data, and cannot output own data in a plaintext manner, in order to realize supervision and binning on the characteristics and not reveal privacy data of each holder, the embodiment of the specification provides a binning method combining two parties. Referring to the interaction process shown in fig. 1, a tag holder 20 homomorphically encrypts an original tag value to obtain a first encrypted tag value, determines a range certificate of the first encrypted tag value, and sends the first encrypted tag value and the range certificate to a feature holder 10; after the first encrypted tag value is verified based on the range certificate, the feature holder 10 reorders the feature values, processes the feature values according to the association relationship between the feature values and the received first encrypted tag value to obtain a second encrypted tag value in an update sequence, and sends the second encrypted tag value in the update sequence to the tag holder 20; the tag holder 20 performs further feature binning operation based on the received information to obtain a first binning result and sends it to the feature holder 10; the feature holder 10 performs binning on the feature values in the update sequence based on the first binning result to obtain a feature binning result. Therefore, the whole interaction process does not have any plaintext data transmission, the supervision and the box separation of the characteristics are realized, and the privacy and the safety of the privacy data are ensured as much as possible by adopting homomorphic encryption.

In fig. 1, each bin (bin 1, bin 2, and bin 3) obtained in feature holder 10 may contain a different number of samples, and in general, the samples in the different bins do not overlap. The feature binning result is only an example, and a feature binning result in a particular application may contain a different number of bins and different samples in each bin.

When the feature holder contains a plurality of features of the sample, the binning method of the embodiment of the present specification may be adopted for each feature, and the supervised binning of the feature is realized through the interaction with the feature holder. When multiple features of a sample are distributed among different feature holders, this may be performed by the feature holder where the features to be binned are located and the label holder in the manner of the embodiments of the present specification.

The following describes an embodiment of the present specification in more detail with reference to a scene diagram shown in fig. 1.

Fig. 2 is a flowchart illustrating a method for privacy-based supervised feature binning according to an embodiment, which is performed by interaction between a feature holder 10 and a tag holder 20. The feature holder 10 stores feature values of a first feature of the N samples, and original tag values of the N samples are stored in the tag holder 20. The values of the N original label values are in a specified range, and the specified range is shared by the two parties. Both parties can preset the tag in a designated range [0, k-1], k is an integer greater than 1, and certainly, the value of the tag can also be preset in [0, k ] or other designated ranges; the specified range may also be a specified integer range of values, for example, the N original tag values may take integer values within the integer range represented by [0, k-1] or [0, k ]. For example, when the classification label belongs to a binary class, the N original label values may take values between 0 and 1, and the specified range may be [0,1 ]. Or, when the classification label belongs to three classifications, the N original label values may take values between 0,1, and 2, and the designated range may be [0,2 ]. In both cases, the N samples are arranged in a predetermined order, which may be the order of the sample Identifiers (IDs) from small to large or from large to small, or may be the order of the sample identifiers displayed in the designated dictionary. The method includes the following steps S210 to S280.

In step S210, the tag holder 20 homomorphically encrypts each of the N original tag values into a corresponding first encrypted tag value using the public Key1, for example, homomorphically encrypts the original tag value yy into e (yy), generates a range certificate based on the N first encrypted tag values, and transmits the N first encrypted tag values and the range certificate arranged in the predetermined order to the feature holder 10. The feature holder 10 may receive the N first encrypted tag values and the corresponding range attestation sent by the tag holder 20 in a predetermined order.

The above range certification is used to certify that the values of the original tag values corresponding to the N first encrypted tag values are within a specified range. For the original tag value, the original tag value is encrypted into a corresponding first encrypted tag value by adopting a homomorphic encryption manner, which is a data processing manner mutually agreed by the tag holder 20 and the feature holder 10. When the tag holder 20 is in agreement, the original tag value can be homomorphically encrypted according to a specified homomorphic encryption algorithm, and no mark is added to the plaintext corresponding to the first encrypted tag value. Since the tag holder 20 obtains N first encrypted tag values to prove that it is actually encrypted in a predetermined manner, the range certificate is generated based on the N first encrypted tag values.

Of course, there is also a possibility that a malicious or impostor may not encrypt in the agreed manner described above, but may also obtain N well-organized first encryption tag values and send them to the feature holder 10 to collect the private data in the feature holder 10. If the malicious or impostor does not have the scope credential, or has only the scope credential of the impostor, the privacy data of the feature holder 10 cannot be extracted.

When the tag holder 20 transmits the N first encrypted tag values and the corresponding range certificates to the feature holder 10, the N first encrypted tag values and the corresponding range certificates may be transmitted simultaneously or continuously, or the N first encrypted tag values and the corresponding range certificates may be transmitted separately. The specific generation and transmission of range certifications is described in more detail at the end of this embodiment.

The feature holder 10 may also send a data acquisition request to the tag holder 20, and when receiving the data acquisition request, the tag holder 20 uses the public Key1 to homomorphically encrypt the N original tag values into corresponding first encrypted tag values, that is, execute step S210.

In this step, the tag holder 20 may directly generate the public Key1 and the corresponding private Key2 for homomorphic encryption, or may directly acquire the public Key1 and the private Key2 that are generated in advance.

The values of the N original tag values may be included in a specified range of values, and each specific value may appear multiple times. For example, the original tag value may take a value within a range of [0,1, 2, 3, 4], and each value within the above range may appear multiple times in the N original tag values. However, the tag holder 20 encrypts the N original tag values one by one using the public key, and all the obtained first encrypted tag values can be made different from each other, so that the feature holder cannot estimate the private data in the tag holder from the values of the N first encrypted tag values arranged in the predetermined order. Therefore, the encryption mode can better keep the tag value of the tag holder secret, and any privacy data are not disclosed as far as possible. The detailed description of the implementation process of homomorphic encryption is described at the end of this embodiment.

Step S220, the feature holder 10 verifies that the value of the original tag value corresponding to the N first encrypted tag values is within the specified range based on the range certification; and when the verification passes, respectively associating the N first encryption tag values with the N characteristic values of the first characteristic based on the set sequence to obtain an association relation.

Specifically, the tag holder 20 may perform calculation processing on the range certificate when verifying that the values of the original tag values corresponding to the N first encrypted tag values are within the specified range based on the range certificate, verify whether the values of the original tag values corresponding to each first encrypted tag value are within the specified range according to the calculation processing result, and if so, verify that the values are passed. When the verification is passed, the N first encryption tag values are obtained by encryption according to an agreed mode, the environmental safety is prompted, and the subsequent steps can be continuously executed; when the verification fails, the environment is considered dangerous and the execution of subsequent steps is suspended or refused. For a detailed description of verifying whether the value of the original tag value corresponding to the N first encrypted tag values is within the above specified range, the following detailed description of the present embodiment may be referred to.

When the verification is passed, the N first encrypted tag values are obtained by encrypting according to an agreed mode, and the N first encrypted tag values can contain multilayer meanings, namely, the tag holder adopts a specified homomorphic encryption algorithm to correctly homomorphically encrypt the original tag values.

The eigenvalue holder 10 stores N eigenvalues arranged in a predetermined order in advance, and may directly associate each of the N eigenvalues arranged in the predetermined order with a corresponding first encrypted tag value. For example, table 1 shows the association of N feature values of a first feature with corresponding first cryptographic label values in feature holder 10.

TABLE 1

Sample ID

1

2

3

4

5

6

…

N

Characteristic value

xx

…

xx

First encrypted tag value

E(yy)

…

E(yy)

And the values of the characteristic values are different, the values of the original label values are different, and the first encryption label values are different from each other.

And associating the characteristic value with the first encryption tag value to obtain the corresponding relation among the sample ID, the characteristic value and the first encryption tag value.

In step S230, the feature holder 10 reorders the N feature values according to the value sizes to obtain a first sequence of N feature values arranged in the update order, and processes the first sequence of N second encrypted tag values arranged in the update order based on the association relationship to obtain a second sequence of N second encrypted tag values arranged in the update order. In step S240, the feature holder 10 at least transmits the second sequence to the tag holder 20, and the tag holder 20 receives the second sequence transmitted by the feature holder 10.

In an embodiment, the N second encryption tag values arranged in the update order may be directly equal to the N first encryption tag values arranged in the update order, that is, the second encryption tag values are equal to the corresponding first encryption tag values. In another embodiment, the first encrypted tag value may be further processed to obtain a corresponding second encrypted tag value, and then N second encrypted tag values arranged according to the update sequence are obtained. In such an embodiment, the second cryptographic label value is not equal to the corresponding first cryptographic label value. The specific implementation process of this embodiment is described in detail in other embodiments after this embodiment. The step of processing the first encrypted tag value to obtain the corresponding second encrypted tag value may be performed before, after, or simultaneously with the reordering of the N feature values.

The type of the first feature may include a continuous type and a discrete type. The discrete features include features with values having a size order relationship, such as age features and height features. For the continuous type, features such as revenue, transaction number, and the like are included.

And sorting according to the value size, wherein sorting according to the value from large to small and sorting according to the value from small to large are carried out, and a specific sorting mode is selected and implemented.

Due to the fact that corresponding relations exist among the sample ID, the characteristic values and the first encryption label values, after the N characteristic values are reordered according to the value sizes, the characteristic value sequence and the second encryption label value sequence after reordering can be obtained, and the first sequence formed by the N characteristic values arranged according to the updating sequence and the second sequence formed by the N second encryption label values arranged according to the updating sequence can be obtained.

For example, referring to table 1, in each column, a sample ID, a feature value, and a first encryption tag value form an association relationship, and after N feature values in table 1 are reordered, a plurality of columns in table 1 transform positions to obtain N feature values arranged in an update order as shown in table 2, and a corresponding first encryption tag value, a second encryption tag value, and an original tag value. And each column corresponds to a position after the update sequence. The feature holder 10 has the positions after the update sequence, and the sample ID, the feature value, the first encryption tag value, and the second encryption tag value, and does not have the original tag values. The original tag values listed in the last row of table 2 are merely for illustrating the correspondence between the original tag values and the first and second encrypted tag values.

TABLE 2

Each position after updating sequence

Position

1

Position 2

Position 3

Position 4

…

Position N

Sample ID

5

22

3

55

…

14

Characteristic value

xx

…

xx

First encrypted tag value

E(yy)

…

E(yy)

Second encrypted tag value

E′(yy)

Original tag value

yy

The feature holder 10 sends a second sequence to the tag holder 20, which in effect carries the updated sequence of positions.

The feature holder 10 may further determine whether there is an equal feature value in the N feature values, and if not, directly send the second sequence to the tag holder 20; if there are equal eigenvalues, then the positions of equal eigenvalues in the update sequence described above may be determined based on the N eigenvalues in the first sequence, and the second sequence and the positions of equal eigenvalues in the update sequence may be sent to the tag holder 20. When binning feature values, equal feature values should be split into the same bin, but not into different bins. Thus, when there is an equal feature value among the N feature values, the feature holder 10 can also transmit the position where the equal feature value exists in the update order to the tag holder 20.

Of the N eigenvalues in the first sequence, there may be multiple sets of equal eigenvalues, e.g., the fifth and sixth eigenvalues are equal, the ninth and tenth eigenvalues are equal, and so on.

Referring to the schematic diagram of feature binning shown in fig. 3, where the first feature is revenue, and referring to the first two rows of fig. 3 after reordering a plurality of feature values from small to large, there are 6 groups of equal feature values in these updated feature values, and overall, these updated feature values can be divided into 7 groups, and the number of each group and the position of the equal feature value contained therein are shown in fig. 3.

Since the respective positions in the first and second sequences correspond, the positions of equal eigenvalues in the update order described above also correspond to the positions in the second sequence.

The positions of the equal feature values in the update sequence can be represented in any one of the following ways:

preset interval symbols exist among positions in the updating sequence and are used for marking the positions of the same characteristic values;

or, each position in the updating sequence is represented by a one-dimensional bitmap, and the positions with the same characteristic value in each position are distinguished by a specified value distribution rule in the one-dimensional bitmap. The one-dimensional bitmap may contain N bits, corresponding to each position.

The updating sequence of the first sequence or the second sequence may include N positions in total, each position may be represented by other information, for example, characters such as 0 or 1, a, and so on may be used for representing, and a preset spacer is added between different feature values, and when there is no preset spacer in the middle of adjacent positions, it indicates that the feature values corresponding to the adjacent positions are equal.

In connection with the example shown in fig. 3, the positions of the equal feature values in the update sequence can be expressed as: 00-0-0000-00-000-00-00. Where 0 represents each position, -is a preset spacer. Or as: 01, 02-03-04, 05,06, 07-08, 09-10, 11, 12-13, 14-15, 16. Wherein, each position is represented by a continuous two-digit numerical sequence number, and positions of different characteristic values are separated by a space between the positions.

When using bitmap representation, in conjunction with the example shown in fig. 3, the various positions in the update order may be represented using the following one-dimensional bitmap: 0010000110001100. wherein each digit represents a position, adjacent identical digits represent positions of identical characteristic values, and adjacent different digits represent positions of different characteristic values. The various positions in the update order may also be represented in the following one-dimensional bitmap: 0011000101001010. wherein, from left to right, when the next different eigenvalue is encountered, 1 is adopted to represent the jump between different eigenvalues.

No matter which way the above is adopted to represent the positions in the update sequence, it does not reveal any feature data stored in the feature holder and any distribution rule between the feature data.

In step S250, the tag holder 20 decrypts the N second encrypted tag values in the second sequence into corresponding original tag values using the private Key2 corresponding to the public Key1, so as to obtain N original tag values arranged in the update order. In decryption, the second encrypted tag value may be decrypted using the private Key2 using a decryption algorithm corresponding to the algorithm used for the above homomorphic encryption. When the second encrypted tag value is equal to the corresponding first encrypted tag value, the private Key2 may be used to directly decrypt the second encrypted tag value to obtain the corresponding original tag value.

In step S260, the tag holder 20 performs a binning merging operation on the N original tag values arranged in the update sequence to obtain a first binning result, where the first binning result shows the first binning corresponding to each position in the update sequence. In step S270, the tag holder 20 transmits the first binning result to the feature holder 10, and the feature holder 10 receives the first binning result transmitted from the tag holder 20.

When the tag holder 20 directly receives the second sequence transmitted by the feature holder 10, it is considered that the values of the N feature values in the feature holder 10 are different from each other. In this case, the tag holder 20 may directly determine the N original tag values arranged in the update order as N initial bins, and perform an adjacent bin merging operation on the N initial bins to obtain a first bin result.

When the tag holder 20 receives the positions of the equal feature values in the update sequence in addition to the second sequence transmitted by the feature holder 10, it is considered that the values of the N feature values in the feature holder 10 have the same feature value. In this case, the tag holder 20 may determine the initial bins corresponding to the N original tag values arranged in the update order according to the positions of the relative feature values in the update order, and perform the adjacent bin merging operation on each initial bin to obtain the first bin result.

The first binning result shows first binning corresponding to each position in the update sequence, and it can be understood that the first binning result shows first binning corresponding to a position where each feature value is arranged according to the update sequence, and also shows first binning corresponding to a position where each original tag value is arranged according to the update sequence. The positions of the first sequence, the second sequence and the N original label values arranged according to the updating sequence are mutually corresponding.

When the tag holder 20 performs the adjacent binning and merging operation based on the N original tag values arranged in the update sequence, the adjacent binning and merging operation may be performed in a chi-square binning manner, or other binning methods based on merging may be used.

The chi-square binning is a discretization method for variable bottom-up (i.e. merging-based) data, and combines adjacent sections with minimum chi-square values together depending on chi-square test in statistics until a certain stopping condition is met.

The first binning result may be represented in any of the following ways:

preset spacers exist among positions in the updating sequence and are used for distinguishing different adjacent first sub-boxes;

or, each position in the updating sequence is represented by a one-dimensional bitmap, and different first boxes corresponding to each position are distinguished by a specified numerical value distribution rule in the one-dimensional bitmap. The one-dimensional bitmap may contain N bits, corresponding to each position.

For a detailed description of the above two modes, reference may be made to the description in step S230 in conjunction with fig. 3, which is not described herein again. See the first binning result given in the last row of fig. 3. By adopting the mode to send the first box dividing result, the sent data volume can be reduced, and the data transmission efficiency is improved. In addition to representing the first binning results in the manner described above, the tag holder 20 may also send the N second encrypted tag values arranged in the update order and the corresponding first bins to the feature holder 10.

No matter which way the first binning result is expressed, it does not reveal any tag data stored in the tag holder, and any distribution rules between tag data.

In step S280, the feature holder 10 performs binning on the feature values at each position in the first sequence according to the first binning result to obtain a feature binning result. The feature holder 10 may associate each position in the first binning result with each position in the first sequence, and determine the first bin of each position in the first binning result as the bin of the feature value of the corresponding position in the first sequence. In this way, it is possible to determine to which bin the first feature of each sample in the feature holder 10 belongs.

As can be seen from the above, in this embodiment, the tag holder performs homomorphic encryption on the original tag value, generates the range certificate, sends the encrypted tag value and the range certificate to the feature holder, and the feature holder associates the feature value with the encrypted tag value when the verification based on the range certificate passes, reorders the feature values, and sends the reordered encrypted tag value to the tag holder. In this way, the tag holder can obtain the updated sorted original tag value by decryption, perform further feature binning operation based on the updated sorted original tag value, and send the obtained first binning result to the feature holder. The whole interaction process does not send any plaintext data, simultaneously realizes the supervision and the box separation of the characteristics, and ensures the privacy and the safety of the privacy data as far as possible by homomorphic encryption.

In another embodiment of the present specification, in the step S210, when the tag holder 20 generates the range certificate based on the N first encrypted tag values, the range certificate may be generated by using a buckletproof algorithm based on the N first encrypted tag values. The Bulletprofo algorithm is an effective zero-knowledge range proving framework, and a range proving method provided by the Bulletprofo algorithm can be used for proving that the value of original data corresponding to encrypted data is in a certain range.

In this embodiment, the range proof may be utilized to verify whether the original tag values corresponding to the N first encrypted tag values are within the specified range. Further, it may be verified whether the original tag value corresponding to the N first encrypted tag values is within a specified integer range, for example, whether the original tag value is within an integer range represented by [0, k-1 ]. For example, when k takes 5, the integer range represented by [0, k-1] is an integer range of 0 to 4.

The following describes the relevant contents in step S210 and step S220 with respect to the generation and verification process of the range certification in conjunction with the buckletproof algorithm.

In one embodiment, when the tag holder 20 generates a range certificate based on the N first encrypted tag values, one range certificate may be generated for the entire N first encrypted tag values. Based on the range certificate, the feature holder 10 verifies whether N original tag values respectively corresponding to the N first encrypted tag values are all within a specified range.

In another embodiment, the tag holder 20 may divide the N first encrypted tag values into m batches, and generate one range certificate for each batch of the first encrypted tag values, so as to obtain m range certificates. When the feature holder 10 receives the m range certificates, the first encrypted tag value of the corresponding batch may be verified based on the range certificate based on a preset correspondence rule. For example, the set correspondence rule may be that N first encryption tag values arranged in a predetermined order are evenly divided into m groups in order; when the range certification is transmitted, the m-batch transmission is also performed. And m is an integer between [1 and N ], and when m is 1, the corresponding scheme is that a range certificate is generated for each first encryption label value.

In one embodiment, the tag holder 20 may send the N first cryptographic tag values and corresponding range certifications to the feature holder 10 simultaneously or consecutively after obtaining the N first cryptographic tag values and corresponding range certifications.

In another embodiment, the tag holder 20 may first send the N first encrypted tag values to the feature holder 10, perform other data (e.g., challenge number) interaction with the feature holder 10, generate a scope certificate based on the N first encrypted tag values and the other data of the interaction, and send the scope certificate to the feature holder 10. The number of challenges may be multiple. The challenge number may be generated by either the feature holder 10 or the tag holder 20. The challenge number may be randomly generated.

The following description will take as an example the case where the tag holder 20 generates a range certificate for m first encrypted tag values. The tag holder may generate a scope certificate by using the private Key2 and the public Key1 in the homomorphic encryption and the specified scope based on the m first encrypted tag values and the corresponding original tag values. In the process, the homomorphic encryption algorithm can be adopted to homomorphic encrypt the corresponding numerical value. In addition, in the above process, the utilization of the challenge number may also be added.

When verifying whether the value of the original tag value corresponding to the m first encrypted tag values is within the specified range based on the range certification, the feature holder 10 constructs a first result and a second result based on the m first encrypted tag values and the specified range and by using homomorphic encryption and a public Key1, and verifies whether the first result is equal to the second result, and when the first result is equal to the second result, the verification is considered to pass, and when the first result is not equal to the second result, the verification fails.

The above process of generating the range proof and the verification process based on the range proof are both performed based on the buckettproof algorithm. On the basis, the implementation process can be correspondingly improved, and different implementation modes can be obtained.

The homomorphic encryption mentioned in the above embodiment is an encryption algorithm in which a plaintext is encrypted after being operated, and the result is equivalent to that obtained by performing a corresponding operation on a ciphertext after being encrypted. E.g. encrypted with the same public Key1

And

to obtain

And

if it is satisfied：

Then it is assumed that the encryption algorithm satisfies the additive homomorphism, where

Corresponding to a homomorphic add operation. In the practice of the method, the raw material,

the operations may correspond to conventional addition, multiplication, etc. For example, in the Paillier algorithm,

corresponding to conventional multiplication.

When the tag holder 20 performs homomorphic encryption on each original tag value, it may also randomly generate an encrypted random number r, and perform homomorphic encryption operation on the original tag value by using the public Key1 and the encrypted random number r to obtain a first encrypted tag value. For example, in the Paillier algorithm, a formula can be employed

And encrypting the original tag value m to obtain a first encrypted tag value C. Wherein m is a plaintext, and C is a ciphertext; key1 is a public Key, generally the product of two very large prime numbers; r is the encrypted random number and mod is the remainder function. Because the generated encrypted random numbers r are different every time, different encrypted tag values can be obtained every time the original tag value is homomorphic encrypted. This ensures that all cryptographic tag values are different from each other.

The homomorphic encryption algorithm in this specification may also employ the ElGamal algorithm. When the algorithm homomorphically encrypts an original label value m, plaintext m is firstly grouped, so that each plaintext group is divided into groups

Is less than a certain value, then for each lightText packet

Separately determining encrypted random numbers

Based on the formula

Computing

And

obtaining a first encrypted tag value

. Where t is the total number of packets, g, y, p are the public keys of the encryption algorithm, and mod is the remainder function.

The encryption process of the ElGamal algorithm needs two times of modular exponential operation and one time of modular product operation, and the decryption process needs modular exponential operation, inversion operation and modular product operation respectively. Each encryption operation needs to select a random number, so that the ciphertext depends on both plaintext and the selected random number, and the ciphertexts generated by different times of encryption are different for the same plaintext.

In another embodiment of the present specification, when the feature holder 10 obtains the second sequence of N second cryptographic label values arranged in the update order based on the association relationship in step S230, the following steps may be performed:

for any one of the N first encrypted tag values, a preset value 0 is homomorphically encrypted to an encrypted random number E (0) using the public Key1, the encrypted random number E (0) is homomorphically added to the first encrypted tag value E (yy) to obtain a corresponding second encrypted tag value E (0 + yy), and a second sequence of N second encrypted tag values arranged in an update order is determined based on the association relationship. In this embodiment, homomorphic addition of the encrypted random number E (0) and the first encrypted tag value E (yy) refers to homomorphic addition operation.

For each first encryption tag value, a homomorphic encryption is performed on the preset 0 value. Based on the above description of the encryption process of the homomorphic encryption algorithm, each time encryption is performed, an encryption random number r is generated, and a value 0 is homomorphically encrypted by using the public Key1 and the encryption random number r, so that the obtained encryption random number is different with the difference of the encryption random numbers r. That is to say, the encrypted random number E (0) obtained by encrypting different first encrypted tag values is also different, so that different encrypted random numbers E (0) are superimposed on different first encrypted tag values, and further, the interference processing on the first encrypted tag values can be realized.

After the feature holder 10 performs the above-described processing on each first encrypted tag value, it is possible to avoid directly transmitting the first encrypted tag value to the tag holder 20, and further avoid the tag holder 20 comparing the update sequence of the first encrypted tag value with the predetermined sequence of the first encrypted tag value generated by itself and thereby estimating partial feature value information. Even if the tag holder 20 embeds a special token in the first encrypted tag value, the processing operations described above for the first encrypted tag value can confuse the characteristic token so that the tag holder 20 cannot infer any private information of the feature holder from the received information.

After such processing is performed on the first encrypted tag value, the second sequence transmitted by the feature holder 10 in step S240 is composed of N second encrypted tag values arranged in the update order. The tag holder 20 may receive a second sequence of N second encrypted tag values arranged in the update order.

According to homomorphism of homomorphic encryption algorithm, homomorphic encryption result of 0 value

And a first cryptographic label value

Performing homomorphic operationThe obtained second encrypted tag value is equal to the result obtained by directly adding the 0 value and the original tag value yy and then carrying out homomorphic encryption

I.e. by

。

Therefore, in step S250, the tag holder 20 may directly perform homomorphic decryption on the N second encrypted tag values in the second sequence using the private Key2 corresponding to the public Key1 to obtain corresponding original tag values, so as to obtain N original tag values arranged in the update order.

Also, if the tag holder 20 does not add any special mark to the first encrypted tag value, it can successfully decrypt the correct original tag value.

In this embodiment, the feature holder superimposes the encrypted random number on the first encrypted tag value, so that even if the tag holder can decrypt the original tag value from the second encrypted tag value, the tag holder cannot acquire any privacy data rule of the feature holder, and a special mark that the tag holder can add is eliminated, thereby improving the data privacy and security of the feature holder.

In another embodiment of the present description, when the positions of equal feature values in the update sequence transmitted by the feature holder 10 are not received, it is considered that no equal feature value exists in the N feature values. The tag holder 20 may perform the first binning result by performing the adjacent binning and merging operation based on at least the N original tag values arranged in the updated order when performing step S260 according to the flowchart shown in fig. 4. The method includes the following steps S261 to S264.

Step S261, using each position corresponding to the N original tag values arranged in the update order as an initial bin, to obtain N initial bins. This embodiment may correspond to a case where N feature values in the feature holder 10 are different from each other.

Step S262, based on the original tag values in the initial bins, performing adjacent bin merging operation on the initial bins to obtain updated bin results, where the updated bin results show the updated bins corresponding to each position in the update sequence. The updated binning result may also be represented in the manner given in step S230, and the representation thereof is not described herein again.

And step S263, when each updated bin does not meet the preset bin dividing condition, taking the updated bin as an initial bin, and returning to execute the step S262.

And step S264, when each updated binning meets the preset binning condition, determining the updated binning result as a first binning result.

In this embodiment, based on the original tag values in each initial bin, adjacent bin merging operations are performed on each initial bin, and the adjacent bin merging operations may be performed in a chi-square bin splitting manner or in other bin splitting manners based on merging. The following describes the card square sub-box in detail.

When the adjacent sub-boxes are combined in a chi-square sub-box mode, the chi-square value of each pair of adjacent initial sub-boxes can be sequentially determined based on the original label value in each initial sub-box, a plurality of chi-square values are obtained, and the pair of adjacent initial sub-boxes corresponding to the minimum chi-square value is combined. In the first iteration, the initial binning is N initial binning corresponding to the N original tag values in the updated sequence. In the second and subsequent iterations, the initial binning is the updated binning in the last iteration.

For example, for

initial bins

1, 2, and 3 … … and 7 arranged in sequence, chi-square values of

initial bins

1 and 2, chi-square values of

initial bins

2 and 3, chi-square values of

initial bins

3 and 4, chi-square values of

initial bins

4 and 5, chi-square values of

initial bins

5 and 6, and chi-square values of

initial bins

6 and 7 may be sequentially determined to obtain 6 chi-square values, and adjacent initial bins corresponding to the smallest chi-square value of the 6 chi-square values may be merged, and assuming that the chi-square values of

initial bins

1 and 2 are the smallest chi-square value,

initial bins

1 and 2 may be merged in this iteration.

In determining the chi-squared value for each pair of adjacent initial bins, the following may be used. Taking an example that the original tag value includes two values (an original tag value 1 and an original tag value 2, for example, 0 and 1, respectively, that is, two classes), for each pair of adjacent initial bins, for example, the initial bin 1 and the initial bin 2, each parameter in table 3 is counted.

TABLE 3

	Original tag value 1	Original tag value 2
				Initial binning 1	A11	A12	R1
Initial binning 2	A21	A22	R2
					C1	C2

Where a11 denotes the number of samples belonging to original label value 1 in initial bin 1, a12 denotes the number of samples belonging to original label value 2 in initial bin 1, a21 denotes the number of samples belonging to original label value 1 in initial bin 2, a22 denotes the number of samples belonging to original label value 2 in initial bin 2, R1 denotes the number of samples in initial bin 1, R1= a11+ a12, R2 denotes the number of samples in initial bin 2, R2= a21+ a22, C1 denotes the number of samples belonging to original label value 1 in two bins, C2 denotes the number of samples belonging to original label value 2 in two bins, C1= a11+ a21, C2= a12+ a 22. A11, a21, a12 and a22 are understood as actual frequencies.

Then, the respective desired frequencies E11, E12, E21 and E22 are determined based on the respective parameters in table 3, see table 4.

TABLE 4

	Original tag value 1	Original tag value 2
			Initial binning 1	E11=(R1/N)*C1	E12=(R1/N)*C2
Initial binning 2	E21=(R2/N)*C1	E22=(R2/N)*C2

Where N is the total number of samples. Using the data in the two tables above (tables 3 and 4), the chi-squared values for initial bin 1 and initial bin 2 were calculated using the following equation (1):

wherein m is the number of the sub-boxes corresponding to the chi-square value, the value is 2, n is the number of the types of the original label value, the value is 2 in the above example,

the chi-square value is obtained. The calculation method of chi-square value when n takes a larger value can be known from the above formula (1).

By utilizing the method, the chi-square value of each pair of adjacent initial binning can be obtained, the binning result is updated, whether the updated binning result meets the preset binning condition or not can be judged, if not, the step S263 is executed, and iteration is continued; if so, step S264 is executed to end the iteration and determine a first binning result.

The preset binning conditions may include: the total number of the plurality of update bins reaches a preset number. The preset number may be determined according to an empirical value, and may be a specific value, or may be a value range, for example, a value between 5 and 8, or another value greater than 1, or a range value similar to [5,8 ]. When the preset number takes the range value, whether the total number of the plurality of updated sub-boxes reaches the preset number is judged, and whether the total number of the plurality of updated sub-boxes is within the range value corresponding to the preset number can be judged.

In step S262, in any iteration, the total number of updated bins may be directly determined, and it is determined whether the total number reaches a preset number.

Or, when the adjacent binning merging operation is performed in a chi-square binning manner, the preset binning conditions may include: the chi-square value of any pair of the plurality of update sub-boxes is greater than a preset threshold value. The preset threshold may be determined based on empirical values.

In step S262, in any iteration after the first iteration, after the updated binning result is obtained, the chi-square value of any pair of updated adjacent bins may be calculated in the current iteration, and it is determined whether the chi-square value of each pair of adjacent initial bins is smaller than the preset threshold, if smaller, the updated binning is used as the initial binning, the next iteration is started, and if not smaller, the updated binning result is determined as the first binning result, and the iteration is ended.

In another embodiment, the implementation process may be modified based on the chi-square binning, for example, based on the original label values in each initial bin, the chi-square values of three adjacent initial bins are sequentially calculated for the three adjacent initial bins to obtain a plurality of chi-square values, and the three adjacent initial bins corresponding to the minimum chi-square value are combined. Of course, the above-mentioned adjacent three initial bins may be replaced by adjacent four initial bins, five initial bins, and so on. When the granularity is different during combination, the binning and combining precision is also different, and the calculation efficiency is also different accordingly. The smaller the granularity is, the higher the box-dividing and combining precision is, the larger the calculated amount is, and the efficiency is relatively low.

In this embodiment, the tag holder can perform binning locally based on the second sequence, so that supervised binning based on tags can be smoothly implemented.

In another embodiment of this embodiment, the tag holder 20 receives the positions of equal feature values in the update sequence sent by the feature holder 10 in addition to the second sequence sent by the feature holder. When the tag holder 20 executes step S260, that is, when performing adjacent binning and merging operations based on at least N original tag values arranged in the update order to obtain a first binning result, the following procedure may be performed, specifically including steps 1a to 4 a.

Step 1a, determining initial sub-boxes corresponding to N original label values arranged according to an updating sequence based on positions of equal characteristic values in the updating sequence. This embodiment may correspond to a case where there is an equal feature value among N feature values in the feature value holder 10.

Specifically, in this step, based on the positions of the equal feature values in the update sequence, for N original tag values arranged according to the update sequence, the original tag values at the positions of different feature values are divided into different initial bins, and the original tag values at the positions of the same feature values are divided into the same initial bins.

This step is explained, for example, with reference to the example shown in fig. 3. For 16 positions corresponding to the 16 samples in fig. 3, the 1 st position to the 16 th position are respectively arranged in a left-to-right order. The tag holder 20 has received the positions of the equivalent feature values in the update sequence, that is, has received information indicating: in fig. 3, the 1 st and 2 nd positions are equal eigenvalues, the 4 th to 7 th positions are equal eigenvalues, the 8 th and 9 th positions are equal eigenvalues, the 10 th to 12 th positions are equal eigenvalues, the 13 th and 14 th positions are equal eigenvalues, and the 15 th and 16 th positions are equal eigenvalues.

Each position in the update sequence is in one-to-one correspondence with the positions of the N original tag values arranged according to the update sequence. Accordingly, for N original tag values arranged in the update order, the original tag values at the 1 st and 2 nd positions may be divided into the initial bin 1, the original tag value at the 3 rd position may be divided into the initial bin 2, the original tag values at the 4 th to 7 th positions may be divided into the initial bin 3, the original tag values at the 8 th and 9 th positions may be divided into the initial bin 4, the original tag values at the 10 th to 12 th positions may be divided into the initial bin 5, the original tag values at the 13 th and 14 th positions may be divided into the initial bin 6, and the original tag values at the 15 th and 16 th positions may be divided into the initial bin 7. The initial binning determined in this way can achieve that subsequent binning results do not divide equal feature values into different bins.

And 2a, performing adjacent box separation and combination operation on each initial box separation based on the original label value in each initial box separation to obtain an updated box separation result, wherein the updated box separation corresponding to each position in the updating sequence is shown.

And 3a, when each updated sub-box does not meet the preset sub-box condition, taking the updated sub-box as an initial sub-box, and returning to execute the step 2 a.

And 4a, when each updated binning meets a preset binning condition, determining the updated binning result as a first binning result.

In each step of this embodiment, except for step 1a, other steps, for example, steps 2a to 4a, are completely the same as the embodiment shown in fig. 4, and specific description may refer to the description of steps S262 to S264 corresponding to the embodiment shown in fig. 4, and are not repeated here.

The various embodiments described above provide a merging-based binning method in feature binning. Based on the same inventive concept, in an application scenario that the characteristics and labels of the sample are distributed in different owners, each owner has a requirement on privacy protection on respective data, and the own data cannot be output in a clear text, the embodiment of the present specification further provides another two-party combined binning method, which is performed based on splitting and binning. In this embodiment, the tag holder 20 homomorphically encrypts the original tag value to obtain a first encrypted tag value, determines a range certificate of the first encrypted tag value, and sends the first encrypted tag value and the range certificate to the feature holder 10; after the first encrypted tag value is verified based on the range certificate, the feature holder 10 reorders the feature values, processes the feature values according to the association relationship between the feature values and the received first encrypted tag value to obtain a second encrypted tag value in an update sequence, and sends the second encrypted tag value in the update sequence to the tag holder 20; the label holder 20 performs a binning splitting operation based on the received information to obtain a first binning result, and sends the first binning result to the feature holder 10; the feature holder 10 performs binning on the feature values in the update sequence based on the first binning result to obtain a feature binning result. Therefore, the whole interaction process does not have any plaintext data transmission, the supervision and the box separation of the characteristics are realized, and the privacy and the safety of the privacy data are ensured as much as possible by adopting homomorphic encryption. The specific process can be seen in the embodiment shown in fig. 5.

Fig. 5 is a supervised feature binning method based on privacy protection according to an embodiment, which is performed by a feature holder 10, where the feature holder stores feature values of a first feature of N samples, original tag values of the N samples are stored in a tag holder 20, the values of the N original tag values are within a specified range, and the N samples are arranged according to a predetermined order. The method includes the following steps S510-S580.

In step S510, the tag holder 20 uses the public key to homomorphically encrypt the N original tag values into corresponding first encrypted tag values, generates a range certificate based on the N first encrypted tag values, and sends the N first encrypted tag values arranged in the predetermined order and the range certificate to the feature holder, and the feature holder 10 may receive the N first encrypted tag values arranged in the predetermined order and the corresponding range certificate sent by the tag holder 20.

Step S520, the feature holder 10 verifies that the values of the original tag values corresponding to the N first encrypted tag values are within the specified range based on the range certification; and when the verification is passed, respectively associating the N first encryption tag values with the N characteristic values of the first characteristic on the basis of the set sequence to obtain an association relation.

In step S530, the feature holder 10 reorders the N feature values according to the value sizes to obtain a first sequence of N feature values arranged in the update order, and processes the first sequence of N second encrypted tag values arranged in the update order based on the association relationship to obtain a second sequence of N second encrypted tag values arranged in the update order.

In step S540, the feature holder 10 at least transmits the second sequence to the tag holder 20, and the tag holder 20 receives the second sequence transmitted by the feature holder 10.

In step S550, the tag holder 20 decrypts the N second encrypted tag values in the second sequence into corresponding original tag values by using the private key corresponding to the public key, so as to obtain N original tag values arranged in the update order. When the second encrypted tag value is equal to the corresponding first encrypted tag value, the second encrypted tag value may be directly decrypted by using a private key to obtain a corresponding original tag value.

The specific implementation of the steps S510 to S550 can be the same as that described in the steps S210 to S250, and for the specific description, reference can be made to the steps S210 to S250, which is not described herein again.

In step S560, the tag holder 20 performs splitting and binning operation at least based on the N original tag values arranged according to the update sequence, so as to obtain a first binning result, where the first binning result shows first binning corresponding to each position in the update sequence.

In step S570, the tag holder 20 sends the first binning result to the feature holder 10, and the feature holder 10 receives the first binning result sent by the tag holder 20, where the first binning result is shown for each position in the updating sequence.

When the tag holder 20 performs splitting and binning operation based on the N original tag values arranged in the second sequence according to the update sequence, Best-KS binning may be performed, or other splitting-based binning methods, for example, a binning method based on minimum entropy may be used.

Best-KS (Kolmogorov-Smirnov) binning can be used to evaluate the model's ability to differentiate risk, which can describe the difference between accumulated samples of different labels when feature data is distributed in different intervals, a top-down (split-based) data discretization method.

The first binning result may be represented in the manner given in step S270, and for a specific implementation, reference may be made to the description in step S270, which is not described herein again.

In step S580, the feature holder 10 performs binning on the feature values at each position in the first sequence according to the first binning result to obtain a feature binning result. For a detailed description of this step, refer to step S280, which is not described herein again.

As can be seen from the above, in this embodiment, the tag holder and the feature holder may interact with each other through homomorphic encryption and homomorphic decryption to encrypt the tag value, and the process is performed when the verification based on the range certification is passed; and based on the execution of the further split binning operation, sending the obtained first binning result to the feature holder. The whole interaction process does not send any plaintext data, simultaneously realizes the supervision and the box separation of the characteristics, and ensures the privacy and the safety of the privacy data as far as possible by homomorphic encryption.

In another embodiment of the present specification, when the feature holder 10 obtains the second sequence of N second cryptographic label values arranged in the update order based on the association relationship in step S530, the following steps may be performed:

for any one first encryption tag value in the N first encryption tag values, a preset 0 value is homomorphically encrypted to an encryption random number E (0) by using a public Key1, the encryption random number E (0) is homomorphically added to the first encryption tag value E (yy) to obtain a second encryption tag value E (0 + yy), and a second sequence formed by the N second encryption tag values arranged in an updating sequence is determined based on the incidence relation. Other remarks are made with reference to the preceding examples.

After such processing is performed on the first encrypted tag value, the second sequence transmitted by the feature holder 10 in step S540 is composed of N second encrypted tag values arranged in the update order. The tag holder 20 may receive a second sequence of N second encrypted tag values arranged in the update order.

In step S550, the tag holder 20 may directly decrypt the N second encrypted tag values in the second sequence using the private Key2 corresponding to the public Key1 to obtain the corresponding original tag values.

In another embodiment of the present description, when the positions of equal feature values in the update sequence transmitted by the feature holder 10 are not received, it is considered that no equal feature value exists in the N feature values. When the label holder 20 performs splitting and binning operation at least based on the N original label values arranged according to the update sequence to obtain a first binning result, step S560 may be performed according to the following iterative procedure, which specifically includes the following steps 1b to 4 b.

And step 1b, taking the N original label values arranged according to the updating sequence as an initial box.

And 2b, aiming at any one initial box, determining a splitting point of the initial box based on the original label value in the initial box, splitting and box-dividing the initial box by the splitting point to obtain an updated box-dividing result, wherein the updated box-dividing result shows the updated box-dividing corresponding to each position in the updating sequence.

And 3b, when each updated sub-box does not meet the preset sub-box condition, taking the updated sub-box as an initial sub-box, and returning to execute the step 2 b.

And 4b, determining the updated binning result as a first binning result when each updated binning meets a preset binning condition.

Initially, there is one initial bin, and after the first iteration, the initial bin is divided into 2 updated bins. In the second iteration, the number of the initial boxes is 2, the initial boxes are divided into 4 updated boxes, and the subsequent box dividing process is carried out in sequence until the divided updated boxes meet the preset box dividing condition. The preset box separating conditions comprise: the total number of the plurality of update bins reaches a preset number.

See figure 6 for a schematic diagram of the split-based binning process. In the first iteration, all the characteristic values of the business income form a first box, a splitting point 1 is determined from the first box, and the first box is split and boxed by the splitting point 1 to obtain a box 1 and a box 2. In the second iteration, the split point 2 of the bin 1 and the split point 3 of the bin 2 are determined by taking the bin 1 and the bin 2 as the first bins respectively. Split point 2 is the point between sample 5 and sample 8 and split point 3 is the point between sample 15 and sample 12. Splitting the sub-box 1 and the sub-box 2 by the splitting point 2 and the splitting point 3 respectively to obtain sub-boxes 3 to 6. And stopping iteration when the total number of the bins reaches a preset number.

In this embodiment, when determining the splitting point of the initial binning based on the original tag value in the initial binning in step 2b, Best-KS binning may be performed, or other splitting-based binning methods may also be used. This will be described in detail below using Best-KS binning as an example.

When the splitting point of each initial split box is determined, regarding any one initial split box, taking a point between each pair of adjacent original label values in the initial split box as a to-be-selected splitting point, dividing the initial split box into two sub-split boxes, and determining the KS sum value of the two sub-split boxes by adopting a Best-KS algorithm based on the original label values in the two sub-split boxes to be used as the feature discrimination corresponding to the to-be-selected splitting point; and selecting the splitting point to be selected corresponding to the maximum feature discrimination from the plurality of splitting points to be selected as the splitting point of the initial binning.

And each initial box is divided into a plurality of initial boxes by adopting the operation to determine the splitting point.

For example, given that 5 eigenvalues of the purchase amount of a company are different from each other, and are, sample 3-11, sample 2-13, sample 5-23, sample 1-24, and sample 4-25 in sequence from left to right, all 5 eigenvalues of the purchase amount may be used as initial bins, and in the first iteration, a point between each pair of adjacent original label values in the initial bins is used as a candidate splitting point, that is, from left to right, a point between sample 3 and sample 2 is a candidate splitting point, and a point between sample 2 and sample 5 is 4 candidate splitting points, which is, one candidate splitting point … …. Each candidate splitting point may divide the initial bin into two sub-bins. For example, for the point to be split between sample 5 and sample 1, the initial bin may be divided into two left and right sub-bins, i.e., sub-bin 1 consisting of characteristic values 11, 23 and sub-bin 2 consisting of characteristic values 24 and 25. In determining the KS sum values of the two

sub-bins

1 and 2 based on the original tag values in the two

sub-bins

1 and 2, the following parameters n11, n12, n21 and n22 may be counted, and the KS value of the sub-bin 1 may be calculated using the formula | n12/n2-n11/n1|, the KS value of the sub-bin 2 may be calculated using the formula | n22/n2-n21/n1|, and then the KS sum values of the two sub-bins may be calculated. And | is an absolute value symbol.

Where n11 is the number of samples with a tag value of 1 in sub-bin 1, n12 is the number of samples with a tag value of 0 in sub-bin 1, n21 is the number of samples with a tag value of 1 in sub-bin 2, n22 is the number of samples with a tag value of 0 in sub-bin 2, n1 is the total number of samples with a tag value of 1 in the initial bin, and n2 is the total number of samples with a tag value of 0 in the initial bin. In the above example, the two-class case where the label values include 0 and 1 is described as an example, and the embodiment in the case of more classes such as the three-class case and the four-class case can be obtained from this description. The embodiment of determining the feature differentiation corresponding to one split point to be selected is given above, so that the feature differentiation of other 3 split points to be selected can be obtained, each split point to be selected corresponds to one splitting mode, and the split point to be selected corresponding to the maximum feature differentiation is selected from the multiple split points to be selected, namely, one splitting mode for initial binning is selected.

The embodiment shown in steps 1 b-4 b is only one implementation of step S560. In another embodiment, step 3b may be modified, for example, when each updated bin does not satisfy the preset binning condition, a partial updated bin is selected from each updated bin as an initial bin, and step 2b is performed.

In another embodiment of this embodiment, the tag holder 20 receives the positions of equal feature values in the update sequence sent by the feature holder 10 in addition to the second sequence sent by the feature holder. The tag holder 20 performs splitting and binning operation at least based on the N original tag values arranged according to the update sequence, and when a first binning result is obtained, step S560 may be performed according to the following iterative procedure, which specifically includes the following steps 1c to 4 c.

And step 1c, taking the N original label values arranged according to the updating sequence as an initial box.

And 2c, aiming at any one initial box, determining a splitting point which is not positioned between the equal characteristic values in the initial box based on the positions of the equal characteristic values in the updating sequence and the original label value in the initial box, splitting and box-dividing the initial box by the splitting point to obtain an updating box-dividing result, wherein the updating box-dividing result shows the updating box-dividing corresponding to each position in the updating sequence.

And 3c, when each updated sub-box does not meet the preset sub-box condition, taking the updated sub-box as an initial sub-box, and returning to execute the step 2 c.

And 4c, when each updated binning meets the preset binning condition, determining the updated binning result as a first binning result. The preset box separating conditions comprise: the total number of the plurality of update bins reaches a preset number.

In this embodiment, when determining the splitting point in the initial bin, which is not located at the position between the equal feature values, based on the position of the equal feature value in the update sequence and the original tag value in the initial bin in step 2c, Best-KS binning may be performed, or other splitting-based binning methods may be used. This will be described in detail below using Best-KS binning as an example.

Aiming at any one initial sub-box, based on the position of the equal characteristic value in the updating sequence, dividing the initial sub-box into two corresponding sub-boxes by taking the point between each pair of adjacent original label values as a point to be selected and split in other positions except the position of the equal characteristic value in the initial sub-box, and based on the original label values in the two sub-boxes, determining the KS sum value of the two sub-boxes by adopting a Best-KS algorithm as the characteristic discrimination of the corresponding point to be selected and split; and selecting the splitting point to be selected corresponding to the maximum feature discrimination from the plurality of splitting points to be selected as the splitting point of the initial binning.

For example, taking the revenue shown in fig. 6 as an example, there are 16 positions from left to right, i.e., 16 samples, one for each position, where sets of equal feature values are visible. Initially, the positions of the 16 feature values are taken as an initial binning whole, and in the first iteration, a point between the sample 7 and the sample 5, a point between the sample 5 and the sample 8, a point between the sample 9 and the sample 3, a point between the sample 11 and the sample 10, a point between the sample 15 and the sample 12, and a point between the sample 13 and the sample 16 are taken as splitting points to be selected respectively. Each candidate splitting point may divide the initial bin into two sub-bins. When the KS sum value of the two

sub-bins

1 and 2 is determined based on the original label value of the two

sub-bins

1 and 2 corresponding to any one candidate splitting point, the following parameters n11, n12, n21 and n22 can be counted, the KS value of the sub-bin 1 is calculated by adopting a formula | n12/n2-n11/n1|, the KS value of the sub-bin 2 is calculated by adopting a formula | n22/n2-n21/n1|, and then the KS sum value of the two sub-bins is calculated. And | is an absolute value symbol.

Where n11 is the number of samples with a tag value of 1 in sub-bin 1, n12 is the number of samples with a tag value of 0 in sub-bin 1, n21 is the number of samples with a tag value of 1 in sub-bin 2, n22 is the number of samples with a tag value of 0 in sub-bin 2, n1 is the total number of samples with a tag value of 1 in the initial bin, and n2 is the total number of samples with a tag value of 0 in the initial bin. In the above example, the example is described with the label value including two classes, and the embodiment in the case of more classes such as three classes, four classes, and the like can be obtained from the description. The embodiment of determining the feature differentiation corresponding to one split point to be selected is given above, so that the feature differentiation of other 5 split points to be selected can be obtained, each split point to be selected corresponds to one splitting mode, and the split point to be selected corresponding to the maximum feature differentiation is selected from the multiple split points to be selected, namely, one splitting mode for initial binning is selected.

The embodiment shown in steps 1 c-4 c is only one implementation of step S560. In another embodiment, step 3c may be modified, for example, when each updated bin does not satisfy the preset binning condition, a partial updated bin is selected from each updated bin as an initial bin, and step 2c is performed.

The related descriptions in the embodiments of fig. 5 and fig. 6 can be referred to the related descriptions in the embodiment of fig. 2, and the embodiments can be referred to each other.

In the above embodiments, each step may include a plurality of sub-operation steps, and in each embodiment of this specification, in a case where there is no explicit description and there is no logical precedence relationship, the order of execution between the plurality of sub-operation steps in each step is variable.

The foregoing describes certain embodiments of the present specification, and other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily have to be in the particular order shown or in sequential order to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Fig. 7 is a schematic block diagram of a joint feature binning apparatus based on privacy protection according to an embodiment. The apparatus 700 is deployed in a feature holder, which may be a variety of computers, clusters, or devices with computing processing capabilities. The characteristic holder stores characteristic values of first characteristics of N samples, original label values of the N samples are stored in the label holder, the values of the N original label values are in a specified range, and the N samples are arranged according to a set sequence. This device embodiment corresponds to the method embodiment shown in fig. 2. The apparatus 700 comprises:

an obtaining module 710 configured to obtain N first encrypted tag values arranged according to the predetermined sequence and sent by the tag holder, and a corresponding range certificate; under the appointed condition, each first encrypted tag value is obtained by homomorphically encrypting the corresponding original tag value by using a public key;

a verification module 720 configured to verify, based on the range attestation, that values of original tag values corresponding to the N first encrypted tag values are within the specified range;

the association module 730 is configured to, when the verification module 720 passes the verification, associate the N first encrypted tag values with the N feature values of the first feature, respectively, based on the predetermined order, to obtain an association relationship;

a reordering module 740 configured to reorder the N eigenvalues according to value sizes to obtain a first sequence composed of N eigenvalues arranged according to an update order, and process to obtain a second sequence composed of N second encrypted tag values arranged according to the update order based on the association relationship;

a first sending module 750 configured to send at least the second sequence to the tag holder, so that the tag holder performs adjacent binning and merging operations based on at least the second sequence to obtain a first binning result;

a first receiving module 760 configured to receive the first binning result sent by the label holder, wherein the first binning result shows first binning corresponding to each position in the update sequence;

a first binning module 770 configured to bin the feature values at the positions in the first sequence according to the first binning result to obtain a feature binning result.

In one embodiment, the verification module 720 is specifically configured to:

and calculating the range certificate, verifying whether the values of the original tag values corresponding to the first encrypted tag values are all in the specified range according to the calculation processing result, and if so, passing the verification.

In a specific embodiment, when the rearrangement module 740, based on the association relationship, processes to obtain a second sequence formed by N second encryption tag values arranged according to the update sequence, the rearrangement module includes:

for any one first encryption tag value in the N first encryption tag values, homomorphically encrypting a preset 0 value into an encryption random number by using the public key, and homomorphically adding the encryption random number and the first encryption tag value to obtain a corresponding second encryption tag value;

and determining a second sequence consisting of the N second encrypted tag values arranged according to the updating sequence based on the incidence relation.

In one embodiment, the first sending module 750 is configured to send the second sequence directly to the tag holder if there is no equal eigenvalue of the N eigenvalues.

In a specific embodiment, the first sending module 760 is configured to, in a case that there is an equal eigenvalue in the N eigenvalues, determine, based on the N eigenvalues in the first sequence, a position where the equal eigenvalue is located in the update order, and send the second sequence and the position where the equal eigenvalue is located in the update order to the tag holder.

In a specific embodiment, the first binning module 770 is specifically configured to:

and respectively corresponding each position in the first classification result with each position in the first sequence, and determining the first classification of each position in the first classification result as the classification of the characteristic value of the corresponding position in the first sequence.

In one embodiment, the positions of the equal feature values in the update sequence are represented by one of the following ways:

preset spacers exist among the positions in the updating sequence and are used for marking the positions of the same characteristic values;

and each position in the updating sequence is represented by a one-dimensional bitmap, and the positions of the equal characteristic values in each position are distinguished by a specified numerical value distribution rule in the one-dimensional bitmap.

Fig. 8 is a schematic block diagram of a joint feature binning apparatus based on privacy protection according to an embodiment. The apparatus 800 is deployed in a tag holder, which may be a variety of computers, clusters, or devices with computing processing capabilities. The label holder stores original label values of N samples, the characteristic value of a first characteristic in the N samples is stored in the label holder, the values of the N original label values are in a specified range, and the N samples are arranged according to a set sequence. This embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2. The apparatus 800 comprises:

an encryption module 810 configured to homomorphically encrypt N original tag values into corresponding first encrypted tag values using a public key, generate a range certificate based on the N first encrypted tag values, and send the N first encrypted tag values arranged in the predetermined order and the range certificate to the feature holder;

a second receiving module 820 configured to receive at least a second sequence transmitted by the feature holder after the feature holder has verified based on the range attestation; the second sequence is composed of N second encryption tag values arranged according to an updating sequence;

a decryption module 830, configured to decrypt, using a private key corresponding to the public key, the N second encrypted tag values in the second sequence into corresponding original tag values, to obtain N original tag values arranged according to an update sequence;

a second binning module 840 configured to perform adjacent binning merging operations based on at least the N original tag values arranged in the update order to obtain a first binning result, where the first binning result shows first binning corresponding to each position in the update order;

a second sending module 850 configured to send the first binned result to the feature holder.

In one embodiment, when the encryption module 810 generates the range certificate based on the N first encrypted tag values, it includes generating the range certificate based on the N first encrypted tag values by using a bullletproof algorithm.

In one embodiment, the second binning module 840 is specifically configured to:

taking each position corresponding to the N original label values arranged according to the updating sequence as an initial sub-box to obtain N initial sub-boxes;

performing adjacent binning merging operation on each initial binning based on the original label value in each initial binning to obtain an updated binning result, wherein the updated binning corresponding to each position in the updating sequence is shown;

when each updated sub-box does not meet the preset sub-box condition, the updated sub-box is used as the initial sub-box, the original label value based on each initial sub-box is returned to be executed, and adjacent sub-box merging operation is carried out on each initial sub-box;

and when each updated binning meets the preset binning condition, determining the updated binning result as a first binning result.

In one embodiment, in addition to receiving the second sequence sent by the feature holder, the positions of equal feature values in the update sequence sent by the feature holder are also received; the second box splitting module 840 is specifically configured to:

determining initial sub-boxes corresponding to the N original label values arranged according to the updating sequence based on the positions of the equal characteristic values in the updating sequence;

when each updated sub-box does not meet the preset sub-box condition, taking the updated sub-box as an initial sub-box, returning to execute the step of performing adjacent sub-box merging operation on each initial sub-box based on the original label value in each initial sub-box;

In a specific embodiment, when the second binning module 840 determines the initial binning corresponding to the N original tag values arranged according to the update sequence based on the positions of the equal feature values in the update sequence, the method includes:

and based on the positions of the equal characteristic values in the updating sequence, dividing the original label values at the positions of different characteristic values into different initial bins and dividing the original label values at the positions of the same characteristic values into the same initial bins for the N original label values arranged according to the updating sequence.

In one embodiment, when the second binning module 840 performs the adjacent binning merging operation on each initial bin based on the original tag value in each initial bin, the method includes:

and sequentially determining the chi-square value of each pair of adjacent initial sub-boxes based on the original label value in each initial sub-box to obtain a plurality of chi-square values, and combining the pair of adjacent initial sub-boxes corresponding to the minimum chi-square value.

In one embodiment, the preset binning conditions include: the total number of the plurality of updating sub-boxes reaches a preset number; or when the adjacent sub-boxes are combined in a chi-square sub-box mode, the chi-square value of any pair of updated sub-boxes in the plurality of updated sub-boxes is larger than the preset threshold value.

Fig. 9 is a schematic block diagram of another privacy protection-based federated feature binning apparatus provided in an embodiment. The apparatus 900 is deployed in a tag holder, which can be a variety of computers, clusters, or devices with computing processing capabilities. The label holder stores original label values of N samples, the characteristic value of a first characteristic in the N samples is stored in the label holder, the values of the N original label values are within a specified range, and the N samples are arranged according to a set sequence. This embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 5. The apparatus 900 comprises:

an encryption module 910 configured to homomorphically encrypt N original tag values into corresponding first encrypted tag values using a public key, generate a range certificate based on the N first encrypted tag values, and send the N first encrypted tag values arranged in the predetermined order and the range certificate to the feature holder;

a second receiving module 920, configured to receive at least the second sequence sent by the feature holder after the feature holder passes the verification based on the range certification, where the second sequence is composed of N second encrypted tag values arranged in an update order;

a decryption module 930 configured to decrypt, using a private key corresponding to the public key, the N second encrypted tag values in the second sequence into corresponding original tag values, to obtain N original tag values arranged according to the update sequence;

a third binning module 940, configured to perform splitting binning operation at least based on the N original tag values arranged according to the update sequence to obtain a first binning result, where the first binning result shows first binning corresponding to each position in the update sequence;

a second sending module 950 configured to send the first binned result to the feature holder.

In one embodiment, the step of generating the range certificate based on the N first encrypted tag values by the encryption module 910 includes generating the range certificate based on the N first encrypted tag values by using a buckletproof algorithm.

In a specific embodiment, the third binning module 940 is specifically configured to:

taking the N original label values arranged according to the updating sequence as an initial box;

for any one initial bin, determining a splitting point of the initial bin based on an original label value in the initial bin, splitting and binning the initial bin at the splitting point to obtain an updated bin result, wherein the updated bin result shows updated bins corresponding to each position in the updating sequence;

when each updated sub-box does not meet the preset sub-box condition, taking the updated sub-box as the initial sub-box, returning to execute the step of determining a splitting point of the initial sub-box based on an original label value in the initial sub-box aiming at any one initial sub-box;

In a specific embodiment, the third binning module 940, for any one initial bin, when determining the splitting point of the initial bin based on the original tag value in the initial bin, includes:

aiming at any one initial sub-box, respectively taking a point between each pair of adjacent original label values in the initial sub-box as a to-be-selected splitting point, dividing the initial sub-box into two corresponding sub-boxes, and determining KS sum values of the two sub-boxes by adopting a Best-KS algorithm based on the original label values in the two sub-boxes to be used as feature discrimination degrees of the corresponding to-be-selected splitting points; and selecting the splitting point to be selected corresponding to the maximum feature discrimination from the plurality of splitting points to be selected as the splitting point of the initial binning.

In one embodiment, in addition to receiving the second sequence sent by the feature holder, the positions of equal feature values in the update sequence sent by the feature holder are also received; the third box splitting module 940 is specifically configured to:

for any initial binning, determining a splitting point which is not located at a position between equal characteristic values in the initial binning based on the position of the equal characteristic value in the updating sequence and an original label value in the initial binning, and splitting and binning the initial binning by using the splitting point to obtain an updated binning result, wherein the updated binning result shows the updated binning corresponding to each position in the updating sequence;

In a specific embodiment, the third binning module 940, for any one initial binning, when determining a splitting point in the initial binning which is not located at a position between equal feature values based on the position of the equal feature value in the update sequence and the original tag value in the initial binning, includes:

for any initial sub-box, based on the position of the equal characteristic value in the updating sequence, dividing the initial sub-box into two corresponding sub-boxes by taking a point between each pair of adjacent original label values in other positions except the position of the equal characteristic value in the initial sub-box as a to-be-selected splitting point, and based on the original label values in the two sub-boxes, determining KS sum values of the two sub-boxes by adopting a Best-KS algorithm to serve as the characteristic discrimination of the corresponding to-be-selected splitting point; and selecting the splitting point to be selected corresponding to the maximum feature discrimination from the plurality of splitting points to be selected as the splitting point of the initial binning.

In one embodiment, the pre-set binning conditions comprise: the total number of the plurality of update bins reaches a preset number.

The above device embodiments correspond to the method embodiments, and for specific description, reference may be made to the description of the method embodiments, which is not described herein again. The device embodiment is obtained based on the corresponding method embodiment, has the same technical effect as the corresponding method embodiment, and for the specific description, reference may be made to the corresponding method embodiment.

Embodiments of the present specification also provide a computer-readable storage medium having a computer program stored thereon, which, when executed in a computer, causes the computer to perform the method of any one of fig. 1 to 6.

The present specification also provides a computing device, including a memory and a processor, where the memory stores executable code, and the processor executes the executable code to implement the method described in any one of fig. 1 to 6.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the storage medium and the computing device embodiments, since they are substantially similar to the method embodiments, they are described relatively simply, and reference may be made to some descriptions of the method embodiments for relevant points.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in connection with the embodiments of the invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The above-mentioned embodiments further describe the objects, technical solutions and advantages of the embodiments of the present invention in detail. It should be understood that the above description is only exemplary of the embodiments of the present invention, and is not intended to limit the scope of the present invention, and any modification, equivalent replacement, or improvement made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims

1. A joint feature binning method based on privacy protection is executed by a feature holder, wherein the feature holder stores feature values of first features of N samples, original tag values of the N samples are stored in the tag holder, the values of the N original tag values are in a specified range, and the N samples are arranged according to a set sequence; the method comprises the following steps:

2. The method of claim 1, the step of verifying, based on the range attestation, that values of original tag values corresponding to the N first encrypted tag values are within the specified range, comprising:

3. The method according to claim 1, wherein the step of processing the second sequence of N second encrypted tag values arranged in the update order based on the association relationship comprises:

4. The method of claim 1, wherein the step of sending at least the second sequence to the tag holder in the absence of an equal eigenvalue of the N eigenvalues comprises: directly sending the second sequence to the tag holder.

5. The method of claim 1, wherein said step of sending at least said second sequence to said tag holder if there is an equal eigenvalue of said N eigenvalues comprises:

and determining the positions of the equal characteristic values in the updating sequence based on the N characteristic values in the first sequence, and sending the positions of the equal characteristic values in the second sequence and the updating sequence to the label holder.

6. The method of claim 1, wherein the step of binning the feature values for each position in the first sequence according to the first binning result comprises:

7. The method of claim 1, wherein the positions of the equivalent eigenvalues in the update order are represented in one of the following ways:

8. A joint feature binning method based on privacy protection is executed by a label holder, wherein the label holder stores original tag values of N samples, the feature value of a first feature in the N samples is stored in the label holder, the values of the N original tag values are in a specified range, and the N samples are arranged according to a set sequence; the method comprises the following steps:

sending the first binned result to the feature holder.

9. The method of claim 8, the step of generating a range attestation based on the N first cryptographic tag values comprising:

and generating a range certificate by using a Bulletprofo algorithm based on the N first encryption tag values.

10. The method of claim 8, wherein said step of performing a binning merge operation on adjacent bins based on at least the N original tag values arranged in the updated order to obtain a first bin result comprises:

11. The method of claim 8, wherein the positions of equal feature values in the update sequence transmitted by the feature holder are received in addition to the second sequence transmitted by the feature holder; the step of performing adjacent binning merge operation based on at least the N original tag values arranged in the update order includes:

12. The method of claim 11, wherein the step of determining the initial bins corresponding to the N original tag values arranged in the update order based on the positions of the equal feature values in the update order comprises:

13. The method of claim 10 or 11, wherein the step of performing a neighbor bin merge operation on each initial bin based on the original label value in each initial bin comprises:

14. The method of claim 10 or 11, the preset binning conditions comprising: the total number of the plurality of updating sub-boxes reaches a preset number; or when the adjacent sub-boxes are combined in a chi-square sub-box mode, the chi-square value of any pair of updated sub-boxes in the plurality of updated sub-boxes is larger than the preset threshold value.

15. A joint feature binning method based on privacy protection is executed by a label holder, wherein the label holder stores original tag values of N samples, the feature value of a first feature in the N samples is stored in the label holder, the values of the N original tag values are in a specified range, and the N samples are arranged according to a set sequence; the method comprises the following steps:

receiving a second sequence sent by at least the feature holder after the feature holder passes the verification based on the scope certificate, wherein the second sequence is composed of N second encryption tag values arranged according to an updating sequence;

sending the first binned result to the feature holder.

16. The method of claim 15, the step of generating a range attestation based on the N first cryptographic tag values comprising:

17. The method of claim 15, wherein the step of performing a split binning operation based on at least the N original tag values arranged in the updated order to obtain a first binning result comprises:

when each updated sub-box does not meet the preset sub-box condition, taking the updated sub-box as an initial sub-box, returning to execute the step of determining a splitting point of the initial sub-box based on an original label value in the initial sub-box aiming at any one initial sub-box;

18. The method of claim 17, wherein the step of determining, for any initial bin, a split point for the initial bin based on the original tag value in the initial bin comprises:

19. The method of claim 15, wherein the positions of equal feature values in the update sequence transmitted by the feature holder are received in addition to the second sequence transmitted by the feature holder; the step of performing splitting and binning operation at least based on the N original tag values arranged in the update sequence to obtain a first binning result includes:

20. The method of claim 19, wherein for any initial bin, determining a split point in the initial bin that is not located at a position between equal eigenvalues based on the position of the equal eigenvalue in the update order and the original label value in the initial bin comprises:

21. The method of claim 17 or 19, the preset binning conditions comprising: the total number of the plurality of update bins reaches a preset number.

22. A joint feature binning device based on privacy protection is deployed in a feature holder, wherein the feature holder stores feature values of first features of N samples, original tag values of the N samples are stored in the tag holder, the values of the N original tag values are within a specified range, and the N samples are arranged according to a set sequence; the device comprises:

23. A joint feature binning device based on privacy protection is deployed in a label holder, wherein the label holder stores original tag values of N samples, the feature value of a first feature in the N samples is stored in the label holder, the values of the N original tag values are within a specified range, and the N samples are arranged according to a set sequence; the device comprises:

24. A joint feature binning device based on privacy protection is deployed in a label holder, wherein the label holder stores original tag values of N samples, the feature value of a first feature in the N samples is stored in the label holder, the values of the N original tag values are within a specified range, and the N samples are arranged according to a set sequence; the device comprises:

a second receiving module configured to receive a second sequence transmitted by at least the feature holder after the feature holder passes verification based on the scope certificate, the second sequence being composed of N second cryptographic tag values arranged in an update order;

25. A computer-readable storage medium, having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any one of claims 1-21.

26. A computing device comprising a memory having executable code stored therein and a processor that, when executing the executable code, implements the method of any of claims 1-21.