CN113688354B

CN113688354B - Chi-square box dividing method based on safe multiparty calculation

Info

Publication number: CN113688354B
Application number: CN202110999974.5A
Authority: CN
Inventors: 何道敬; 孙黎彤; 杜润萌; 张民; 张熙; 廖清
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2021-08-27
Filing date: 2021-08-27
Publication date: 2023-06-09
Anticipated expiration: 2041-08-27
Also published as: CN113688354A

Abstract

The invention discloses a chi-square box dividing method based on secure multiparty calculation, which provides a new chi-square value calculating method for federal learning feature engineering, does not need to encrypt all feature data to be sent to a data application party for feature preprocessing, firstly groups the feature data according to categories, mixes false groups, marks the grouping categories, encrypts and sends the grouping categories to the data application party, the encrypted grouping categories can greatly reduce the data quantity of encryption processing, the data application party does not need to decrypt all the feature data, and huge resource loss is avoided; the data provider sends the packet information of the characteristic data to the data application party, the packet information of the characteristic data is obtained after decryption by the data application party, the actual content of the characteristic data is not contained, false packet information is added to the packet information, and the true packet and the false packet are coded and marked.

Description

Chi-square box dividing method based on safe multiparty calculation

Technical Field

The invention belongs to the field of federal learning, and particularly relates to a chi-square box dividing method based on safe multiparty calculation.

Background

Rather than directly modeling with raw data, a data set needs to be constructed before federal learning begins. The task of converting raw data into a dataset is known as feature engineering.

The feature selection is an important step in feature engineering, and generally when a classification model is built, continuous variables are required to be discretized, and after the features are discretized, the model is more stable, so that the risk of model overfitting is reduced. During feature selection, a binning operation is often performed, which is to discretize continuous feature data. The benefits of binning are numerous, for example: the method has stronger robustness to the abnormal data, and solves the problem of modeling interference of the abnormal data; after the feature data are discretized, each feature data has independent weight, so that nonlinearity is introduced into the logistic regression model, and the expression capacity of the model can be improved; the missing values of the features can be taken as an independent class to be brought into the model, and the sparse vector inner product multiplication operation formed after the feature discretization is fast, the calculation result is convenient to store and easy to expand, and the like. For accurate discretization, the data is partitioned by category, if two adjacent bins have very similar category distributions, then the two bins may be merged, otherwise they should remain separate, while a low chi-square value indicates that there are similar category distributions within the two adjacent bins. And calculating the chi-square value of the characteristic data after the characteristic data is divided into boxes, wherein the smaller the chi-square value is, the more similar the distribution is, and the characteristic data can be combined into one box.

In the process of feature discretization and feature prediction capability evaluation, a party needing to lack feature tag data sends own feature data to a party with feature tags for joint feature preprocessing in the federal learning feature preprocessing process.

In most of the existing federal learning frameworks, a part of methods are to enable a data provider to encrypt all feature matrices by using a public key in calculation to meet the requirement of privacy protection, then send ciphertext matrices to a data application party, and the data application party decrypts the data by using the private key after taking the data. This approach obviously results in significant resource loss and performance degradation in large-scale data collection. The other part directly transmits desensitized data to calculate, so that the privacy safety of the data cannot be protected, the legal standards are not met, and the other part of participants independently train themselves, so that training results are fused, and the value of the data cannot be fully exerted.

Disclosure of Invention

The invention aims to provide a novel chi-square box dividing method based on safe multiparty calculation, which is characterized in that for accurate discretization of data, firstly, the data is divided into sections according to categories, if two adjacent sections have very similar category distribution, the two sections can be combined, otherwise, the two sections should be kept separate, and a low chi-square value indicates that the two adjacent sections have similar category distribution. And calculating the chi-square value of the characteristic data after the characteristic data is divided into boxes, wherein the smaller the chi-square value is, the more similar the distribution is, and the characteristic data can be combined into one box.

The specific technical scheme for realizing the aim of the invention is as follows:

a chi-square box-dividing method based on secure multiparty calculation comprises the following steps:

step 1: the data provider generates a pair of public key pk and private key sk through a homomorphic encryption system, and features data X= { X ₀ ,x ₁ ,...,x _n-1 },id∈[0,n-1]Grouping the ids of the data of the same category in the characteristic data X into one section, which is denoted as s groups, and denoted as X _t ,t∈[0,s-1]Where n, s are positive integers and marking the real packet x _t Is 1, the packet class is encrypted using public key pk, denoted as E _x =e (1), resulting in the true Group information Group _t (x _t ,E _x )；

Step 2: constructing false grouping, randomly dividing id of characteristic data X into s grouping intervals, keeping the number of the grouping intervals consistent with that of real grouping, and recording the intervals as X _v ,v∈[0,s-1]And marking class 0 of the dummy packet, using public key pk to mark packet class encryption as E _x E (0) to obtain false packet information as Group _v (x _v ,E _x )；

Step 3: connecting real grouping information and false grouping information according to rows, and obtaining grouping information Group according to row disorder _X The data provider groups the Group information _X (x _i ,E _x ) Transmitting to a data application party;

step 4: the data application party groups the Group information _X (x _i ,E _x ) And tag data y= { Y ₀ ,y ₁ ,...,y _i ,...,y _n-1 },id∈[0,n-1]Is mapped to an id of each packet interval x _i Corresponding tag data y _i To each packet interval x _i Corresponding tag data y _i Is added to obtain the number Group of response samples in the grouping interval _y According to the total number Group of data in the grouping interval _s Calculating the number Group of unresponsive samples in the grouping interval _n ＝Group _s -Group _y And the number of response samples of all the packet intervals is Group _y Number of unresponsive samples Group _n Total number of samples Group _s And a packet class label E corresponding to the packet section _x Transmitting to a data provider;

step 5: data provider marks packet class E using private key _x Decrypting to obtain the decrypted packet class mark D _x Wherein D is _x Let 1 be the true packet, D _x If the value of the code word is=0, the code word is a false packet, and false packet information is deleted;

step 6: the data provider responds to the number Group of the samples corresponding to the real grouping interval _y Number of unresponsive samples Group _n Total number of samples Group _s Calculate the i, i E [0,2s-1 ]]Expected sample number E of the j-th class of the group _ij Where j ε [0, 2) represents both the responding sample and the non-responding sample; based on the expected number of samples E of two adjacent real groups _ij Sample number A of two adjacent real groups _ij Calculating chi-square value of two adjacent real groups ² ；

Step 7: the data provider sets the box division number limit, two groups with the smallest box division value are combined according to the box division value of the adjacent group, the box division value of the adjacent group is recalculated after the two groups are combined, and the combination is stopped until the box division number reaches the box division number limit, so that a box division result of the box division number is obtained.

Step 1. The real packet x _t Wherein only the id of the characteristic data is included, id E [0, n-1 ]]The actual value of the characteristic data is not contained, and leakage of the actual value of the characteristic data is avoided.

And step 2, randomly dividing the id of the characteristic data X into s packet intervals, and constructing a false packet, mixing the false packet into a real packet, and protecting real packet information.

Step 3, grouping information Group _X (x _i ,E _x ) Wherein the dummy packet information is mixed with the real packet information and the classes of the dummy packet and the real packet are encrypted, protecting the privacy of the feature data.

Step 4, the number of response samples Group _y The method is obtained according to the following steps: grouping information x _i The id of the feature data is contained in the tag data Y, and the id is corresponding to the id of the tag data Y to obtain grouping information x _i Corresponding tag value, if x in the ith packet information _i ＝[0,2]The corresponding tag value is y ₀ ,y ₂ ]Because the label value of the response sample is 1 and the label value of the non-response sample is 0, the label values corresponding to the grouping information are added to obtain the number Group of the response samples of the grouping _y 。

Step 4, the number of unresponsive samples Group _n The following means: the number of samples in each packet is packet information x _i The number of ids in (i.e. x in the packet information) _i The length of (a) gives the number of samples Group of the packet _s Subtracting the number of response samples according to the number of samples of the Group to obtain the number Group of non-response samples _n 。

Step 6 the expected sample number E of the ith group and the jth class _ij The calculation formula of (2) is as follows:

wherein R is _i The sum of the number of samples representing the j, j+1 th class of the i-th packet, i.e. R _i ＝Group _s ⁽ⁱ⁾ ，C _j C when j represents the response sample class _j ＝Group _y ⁽ⁱ⁾ +Group _y ⁽ⁱ⁺¹⁾ N represents the total number of samples of two adjacent packets, i.e. n=group _s ⁽ⁱ⁾ +Group _s ⁽ⁱ⁺¹⁾ 。

Step 6, chi-square value ² The calculation formula is as follows:

wherein A is _ij Is the actual sample number of the ith group, the jth category, if j represents the response sample of the ith group, then A _ij ＝Group _y ⁽ⁱ⁾ ，E _ij Is the expected number of samples in the ith group, jth category.

The beneficial effects of the invention are that

In the aspect of safety, the invention protects the data privacy of the card side packet in the federal learning characteristic engineering stage, takes characteristic data packets, takes the data index id of the same class as real packet information, adds false packet information, marks the real packet class as 1, marks the false packet class as 0, encrypts the 0 and 1 codes of the packet class, mixes the real packet information with the false packet information and then sends the mixed false packet information to a data application side, and the data application side does not know the specific value of the characteristic data of the packet, only knows the id corresponding to the characteristic data, and mixes the false packet, thereby protecting the data privacy of the characteristic data.

In terms of operation efficiency, the invention does not need to encrypt all the characteristic values to be sent to a data application party, only encrypts the grouping category of the characteristic data, avoids the calculation cost of encrypting and decrypting a large amount of data, and has quite obvious efficiency in a scene of a large data set.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following specific examples and drawings. The procedures, conditions, experimental methods, etc. for carrying out the present invention are common knowledge and common knowledge in the art, except for the following specific references, and the present invention is not particularly limited.

Examples

Data provider feature data x= {0,2,2,4,5,6,6,6}, data application label data y= {0,1,1,1,0,0,1,1}, taking as an example the chi-square binning result of computing data provider feature data X, the chi-square binning method step based on secure multiparty computing is specified:

firstly, the data provider divides the ids of the data with the same category of the characteristic data X into a section, and the grouping result is as follows: x is x _t ＝[0]，[1，2]，[3]，[4]，[5，6，7]Altogether 5 packets, labeled as real packets, and encrypted packet class E using public transport pk _x =e (1), resulting in the true Group information Group _t (x _t ，E _x ) True Group information Group _t (x _t ，E _x ) The specific contents are as follows:

x _t	E _x
		[0]	E(1)
[1，2]	E(1)
		[3]	E(1)
[4]	E(1)
		[5，6，7]	E(1)

secondly, constructing false grouping, randomly dividing the id of the characteristic data X into s sections, wherein the grouping result is as follows: x is x _v ＝[0，1，2]，[3，4]，[5]，[6]，[7]The number of packets is kept consistent with the number of real packets for a total of 5 packets. Marking these packets as spurious packets and encrypting packet class E using public key pk _x =e (0), resulting in a false packet information Group _v (x _v ，E _x ) False packet information Group _v (x _v ，E _x ) The specific contents are as follows:

x _v	E _x
		[0，1，2]	E(0)
[3，4]	E(0)
		[5]	E(0)
[6]	E(0)
		[7]	E(0)

then, the true grouping information Group _t (x _t ，E _x ) And false packet information Group _v (x _v ，E _x ) Connected by rows and out of order by rows to obtain grouping information Group _X (x _i ，E _x ) And transmitting the packet information to the data application, the packet information Group _X (x _i ，E _x ) The specific contents are as follows:

x _i	E _x
		[0，1，2]	E(0)
[3，4]	E(0)
		[0]	E(1)
[5]	E(0)
		[1，2]	E(1)
[3]	E(1)
		[6]	E(0)
[7]	E(0)
		[4]	E(1)
[5，6，7]	E(1)

then, the data application party groups the Group information _X The id mapping with the tag data y= {0,1,1,1,0,0,1,1} yields the value of the tag data corresponding to each packet section as follows, and each packet section x _i Corresponding tag data y _i Adding to obtain the number Group of response samples in the grouping interval _y According to the total number Group of data in the grouping interval _s Calculating the number of unresponsive samples in the grouping interval

Group _n ＝Group _s -Group _y

/>

Then, the number of response samples of all the packet intervals is Group _y Number of unresponsive samples Group _n Total number of samples Group _s And a packet class label E corresponding to the packet section _x Transmitting to a data provider;

the data provider decrypts the packet class mark E using the private key sk _x The true packet information is obtained, and the packet with the decrypted packet class mark of 1 is the true packet. According to the number Group of response samples corresponding to each real grouping interval _y Number of unresponsive samples Group _n Total number of samples Group _s Calculating the expected sample number E of the jth class of the ith group _ij Where j ε [0,2 ] represents both the responding sample and the non-responding sample, here in two adjacent real packet intervals [0 ]]And [1,2 ]]For example, the chi-square value of two packets is calculated, and the information of two adjacent real packets is as follows:

packet numbering	Grouping	Group _y	Group _n	R _i (Group _s )
					0	[0]	0	1	1
1	[1，2]	2	0	2
					-------------	C _j	2	1	3

Grouping interval [0 ]]Number of response samples Group _y ⁽⁰⁾ =0, the total number of samples is Group _s ⁽⁰⁾ =1, the number of unresponsive samples is Group _n ⁽⁰⁾ =1, then the period of the packetThe number of the expected samples is as follows:

grouping [1,2 ]]The expected number of samples is

Based on the expected number of samples E of two adjacent real groups _ij Sample number A of two adjacent real groups _ij Finally, the chi-square value of two adjacent real groups is calculated ² ；

The data provider sets the limit of the number of sub-boxes, and according to the chi-square value of adjacent groups, chi-square value is obtained ² And combining the minimum two groups, and re-calculating the chi-square value of the adjacent groups after the combination until the number of the boxes reaches the limit of the number of the boxes, and stopping the combination to obtain the chi-square box-dividing result.

Claims

1. The chi-square box separating method based on the safe multiparty calculation is characterized by comprising the following steps of:

step 1: the data provider generates a pair of public key pk and private key sk through a homomorphic encryption system, and features data X= { X ₀ ,x ₁ ,...,x _n-1 },id∈[0,n-1]Grouping the ids of the data of the same category in the characteristic data X into one section, which is denoted as s groups, and denoted as X _t ,t∈[0,s-1]N, s are positive integers and mark the real packet x _t Is 1, the packet class is encrypted using public key pk, denoted as E _x =e (1), resulting in the true Group information Group _t (x _t ,E _x )；

step 6: the data provider responds to the number Group of the samples corresponding to the real grouping interval _y Number of unresponsive samples Group _n Total number of samples Group _s Calculate the i, i E [0,2s-1 ]]Expected sample number E of the j-th class of the group _ij Where j ε [0, 2) represents both the responding sample and the non-responding sample; based on the expected number of samples E of two adjacent real groups _ij Sample number A of two adjacent real groups _ij Calculated to obtainChi-square value of two adjacent real groups ² ；

2. The chi-square binning method based on secure multiparty computing of claim 1, wherein the real packet x of step 1 _t Wherein only the id of the characteristic data is included, id E [0, n-1 ]]The actual value of the characteristic data is not contained, and leakage of the actual value of the characteristic data is avoided.

3. The chi-square binning method based on secure multiparty computation according to claim 1, wherein step 2 randomly divides the id of the feature data X into s packet intervals in order to construct a dummy packet, mix the dummy packet into a real packet, and protect real packet information.

4. The chi-square binning method based on secure multiparty computing of claim 1, wherein the grouping information Group of step 3 _X (x _i ,E _x ) Wherein the dummy packet information is mixed with the real packet information and the classes of the dummy packet and the real packet are encrypted, protecting the privacy of the feature data.

5. The chi-square binning method based on secure multiparty calculation according to claim 1, wherein the response sample number Group of step 4 _y The method is obtained according to the following steps: grouping information x _i The id of the feature data is contained in the tag data Y, and the id is corresponding to the id of the tag data Y to obtain grouping information x _i Corresponding tag value, if x in the ith packet information _i ＝[0,2]The corresponding tag value is y ₀ ,y ₂ ]Since the response sample tag value is 1, the non-response sample tag value is 0, the packet isAdding the tag values corresponding to the information to obtain the number Group of response samples of the Group _y 。

6. The chi-square binning method based on secure multiparty computing of claim 1, wherein the number of unresponsive samples Group of step 4 _n The following means: the number of samples in each packet is packet information x _i The number of ids in (i.e. x in the packet information) _i The length of (a) gives the number of samples Group of the packet _s Subtracting the number of response samples according to the number of samples of the Group to obtain the number Group of non-response samples _n 。

7. The chi-square binning method based on secure multiparty computing of claim 1, wherein the ith group jth category of expected sample number E of step 6 _ij The calculation formula of (2) is as follows:

wherein R is _i The sum of the number of samples representing the j, j+1 th class of the i-th packet, i.e. R _i ＝Group _s ⁽ⁱ⁾ ，C _j C when j represents the response sample class _j ＝Group _y ⁽ⁱ⁾ +Group _y ⁽ⁱ ⁺¹⁾ N represents the total number of samples of two adjacent packets, i.e. n=group _s ⁽ⁱ⁾ +Group _s ⁽ⁱ⁺¹⁾ 。

8. The chi-square binning method based on secure multiparty computing of claim 1, wherein the chi-square value χ of step 6 ² The calculation formula is as follows: