CN109492428A

CN109492428A - A kind of difference method for secret protection towards principal component analysis

Info

Publication number: CN109492428A
Application number: CN201811265579.9A
Authority: CN
Inventors: 杨庚; 徐亚红; 汪伟亚; 蒋辰
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2018-10-29
Filing date: 2018-10-29
Publication date: 2019-03-19

Abstract

The invention discloses a kind of difference method for secret protection towards principal component analysis, comprising the following steps: data matrix centralization, i.e., each dimension data subtract the mean value of this dimension；Covariance matrix is calculated to data matrixCalculate the eigenvalue λ and feature vector V of covariance matrix A；Calculate the principal component number k retained；Initial data is mapped to principal component space and obtains projection matrix Z；To the projection matrix Z each column Elemental partition privacy budget ε_j, calculate the random noise of addition；Noise is added to the projection matrix Z, the projection matrix Z ' after obtaining plus making an uproar；Calculate the error between initial data and low-rank approximate data.The present invention both can realize the simplification of data, and can add noise to avoid the data to " inessential " effectively to data set dimensionality reduction; reduce the waste of privacy budget; to improve the availability of data, the data of publication are made to reflect truthful data as far as possible, while protecting the privacy of data.

Description

A kind of difference method for secret protection towards principal component analysis

Technical field

The present invention relates to a kind of difference method for secret protection towards principal component analysis, belongs to field of information security technology.

Background technique

With the continuous development of big data technology, the data of various information system storages are more and more abundant, increase data Analyze the complexity of processing.As one of the important method of data analysis, principal component analysis can be converted to multivariable several Primary variables, these primary variables can indicate most information of initial data, disclose data essence.Principal component analysis is real Showed the simplification of data so that data be easier to using while reduce the computing cost of algorithm.It is generally comprised in data set Many privacy informations, if directly analyzing data using machine learning or data mining algorithm, it will bring privacy leakage problem. Difference method for secret protection is a kind of current secret protection technology of hot topic, is realized by noise mechanism, i.e., into output result Random noise is added to protect data safety, the noise of addition is bigger, and data are safer, however, the availability of data is lower, instead ?.

For multiattribute data, privacy budget of traditional Laplce's mechanism to all properties distribution same size, this Scheme is simple to operation, but the noise that will lead to addition is too big, and availability of data drastically reduces, while giving the number of " inessential " According to distribution privacy budget, a part of privacy budget is wasted, therefore the effect is unsatisfactory.

Summary of the invention

Problem to be solved by this invention provides one kind towards principal component analysis aiming at the defects of background technique Difference method for secret protection, the present invention both can realize the simplification of data, and can be to avoid right effectively to data set dimensionality reduction The data of " inessential " add noise, reduce the waste of privacy budget, to improve the availability of data, keep the data of publication most It may reflect truthful data, while protect the privacy of data.

To solve the above-mentioned problems, it adopts the following technical scheme that

A kind of difference method for secret protection towards principal component analysis of the invention is based on preset sample data set X, sample This number n, sample space dimension d；Principal component analytical method the following steps are included:

Step 1: data matrix centralization, i.e., each dimension data subtract the mean value of this dimension；

Step 2: calculating covariance matrix with the data matrix that step 1 obtainsWherein, X^TIt is data matrix X Transposition；

Step 3: calculating the eigenvalue λ and feature vector V of covariance matrix A described in step 2, meet AV=λ V；It will be special Value indicative descending is arranged with: λ₁>λ₂…>λ_d, corresponding feature vector is v₁,v₂…v_d；

Step 4: calculating the principal component number k of reservation；

Step 5: initial data being mapped to principal component space and obtains projection matrix Z；

Step 6: giving the projection matrix Z each column Elemental partition privacy budget ε_j, calculate the random noise of addition；

Step 7: adding noise to the projection matrix Z, the projection matrix Z ' after obtaining plus making an uproar；

Step 8: calculating the error between initial data and low-rank approximate data.

In step 1, for convenience of covariance matrix is solved, each dimension mean value is 0 after centralization, goes mean value to each attribute, As shown in formula (1):

x_jIt is the data of j-th of attribute of all samples, x '_jIt is the data of j-th of attribute of all samples after centralization, x_ijIt is The data of i-th of sample, j-th of attribute in data set X,It is the mean value of j-th of attribute.

In step 4, to an eigenvalue contribution value α of setting, wherein 0≤α≤1 calculates the principal component number to be retained K makes it meet the principal component eigenvalue contribution value per >=α actually retained, in which:

In step 5, the projection matrix Z=XV_kIt is mapping of the initial data on principal component space, wherein V_k=v₁, v₂…v_kIt is the corresponding feature vector of k principal component retained.

In step 6, the random noise is Laplace noise, i.e. noise obeys Laplace distribution Lap (b), and b is scale Parameter, b=Δ f/ ε, Δ f are global susceptibility, and ε is privacy budget；

It is as follows to obey the Laplace distribution probability density function that scale parameter is b:

Wherein, x indicates all possible value, and p (x) is the probability of all values

Projection matrix Z=XV_kJth column indicate mapping of the initial data in j-th of principal component, each column indicate difference Meaning, equal or unequal privacy budget ε can be distributed_j, wherein 1≤j≤k.

Distribute equal privacy budget ε_j: divide equally:Each column distribute equal privacy budget；

The privacy budget ε that distribution does not wait_j: press weight distribution:It is distributed according to principal component characteristic value accounting Privacy budget.

In step 7, adding the projection matrix after making an uproar is Z '=(z '₁,z′₂…z′_j…z′_k), wherein z '_j' expression formula such as Under:

z_jIt is the jth column of projection matrix,It is the global susceptibility of projection matrix.

In step 8, low-rank approximate matrix It is eigenvectors matrix V_kTransposition, It is the mean value of attribute, wherein

Approximate data error is calculated using formula (5)；

MSE-F=| | Y-X | |_F (5)

||·||_FIt is the F norm of matrix；The F norm of matrix refers to the quadratic sum of matrix element evolution again；

The matrix of a m × n is let c be, then the F norm of C are as follows:

The present invention by adopting the above technical scheme, compared with prior art, has following technical effect that

The present invention is directed to the too big defect of traditional Laplce's mechanism addition noise, proposes that one kind more preferably adds the side of making an uproar Formula achievees the purpose that secret protection, while ensure that number so that the low-rank approximate data that reduction obtains is distorted to a certain extent According to availability.The method of the present invention is simple, easy to operate and do not limit data set size and attribute, and feature is as follows:

(1) safety for guarantee Principal Component Analysis Algorithm is devised by adding noise appropriate in projection matrix Principal Component Analysis Algorithm towards difference secret protection, and prove that algorithm meets difference privacy conditions；

(2) compared with traditional Laplce's mechanism, the program, which only adds the data of " important ", makes an uproar, and avoids privacy budget Waste.It is smaller to data addition noise under identical secret protection degree, to improve the availability of data, make the number of publication According to reflection truthful data as far as possible, while protecting the privacy of data.

Detailed description of the invention

Fig. 1 is used in experiment provided by the invention for testing the data of difference privacy Principal Component Analysis Algorithm performance Schematic diagram；

Fig. 2 is the work flow diagram of the difference method for secret protection provided by the invention towards principal component analysis.

Specific embodiment

The implementation of technical solution of the present invention is described in further detail with reference to the accompanying drawing, it should be understood that these examples It is only illustrative of the invention and is not intended to limit the scope of the invention, after the present invention has been read, those skilled in the art couple The modification of various equivalent forms of the invention falls within the application range as defined in the appended claims.

The present invention, which first calculates, retains principal component number, then initial data is mapped to principal component space and obtains projection matrix, For projection matrix each column Elemental partition privacy budget, Laplace noise of the addition in data is calculated, it both can effectively logarithm According to collection dimensionality reduction, the simplification of data is realized, and can add noise to avoid the data to " inessential ", reduce the wave of privacy budget Take, to improve the availability of data.It difference secret protection technical definition of the present invention one and its stringent attacks Hit model, and carried out stringent mathematical proof and quantificational expression to privacy risk, at the same difference privacy mechanism also can it is main at Good balance is obtained in terms of analysis result availability and safety two.

Referring to fig. 2, specific embodiment is as follows:

Step 1: collection obtains sample data set Secom.txt, storage be each attribute in semiconductor process number According to sample number 1567, attribute number is 591, data set X={ x₁,x₂…x₅₉₁, x_iIt is the number of all sample ith attributes According to.With formula (1) to every one-dimensional data centralization.10 attribute datas of data set after centralization are taken, as follows:

x₁=[16.47710442,81.32710442, -81.84289558 ... -35.64289558, - 119.53289558-69.53289558]^T

x₅₀=[- 7.93969674, -0.99239674,5.01130326 ... -4.19689674,7.65940326, 7.02220326]^T

x₁₀₀=[- 0.0266401, -0.0173401,0.1202599 ... -0.0192401,0.1435599, - 0.0647401]^T

x₁₅₀=[- 2.54326790, -0.529267903, -1.99526790 ... -2.84217094e-14, 1.43873210-2.84217094e-14]^T

x₂₀₀=[- 0.91205637,0.11794363, -1.82205637 ... -7.61205637, -2.47205637, - 2.84205637]^T

x₂₅₀=[110.29433331,83.37773331, -5.24676669 ... 7.68593331, -10.22116669, 12.12073331]^T

x₃₀₀=[- 0.04006684, -0.00416684, -0.00196684 ... -0.02826684,0.02093316, - 0.03726684]^T

x₃₅₀=[2.14776410e-03, -2.25223590e-03, -4.45223590e-03 ... -3.46944695e- 17, -3.46944695e-17, -3.46944695e-17]^T

x₄₀₀=[- 0.9083303, -1.9865303, -0.2702303 ... 0.3510697, -1.0224303, 2.3229697]^T

x₄₅₀=[0.59278442, -0.23961558, -0.46731558 ... 0.38228442,1.83908442, 1.08908442]^T

Step 2: calculating covariance matrix A with the data matrix that step 1 obtains.

Step 3: calculating the eigenvalue λ and feature vector V of step 2 covariance matrix A.Characteristic value descending is arranged, first 5 Characteristic value and feature vector are as follows:

λ₁=53415197.85687523v₁=[- 6.39070760e-04,2.35722934e-05,2.36801459e- 04,…,2.61329351e-08,5.62597732e-09,3.89298443e-04]^T

λ₂=21746671.90465921v₂=[- 1.20314234e-04, -6.60163227e-04,1.58026311e- 04,…,-6.06233975e-09,5.96647587e-09,-2.32070657e-04]^T

λ₃=8248376.61529074v₃=[1.22460363e-04,1.71369126e-03,3.28185512e- 04,…,1.09328336e-09,8.83024927e-09,7.13534990e-04]^T

λ₄=2073880.85929397v₄=[- 2.72221201e-03,2.04941860e-04,4.20363040e- 04,…,2.66843972e-07,5.91392106e-08,-1.42694472e-03]^T

λ₅=1315404.38775829v₅=[- 1.19198101e-05, -3.62618336e-03, -2.27104930e- 04,…,-3.24788891e-07,-9.39871716e-08,-3.98748600e-03]^T

Step 4: being determined according to formula (2) and retain principal component number.Take eigenvalue contribution value α=95%, then require per >= 95%, calculate to obtain k=5.

Step 5: calculating projection matrix Z=XV₅。V₅=(v₁,v₂…v₅) be retain the corresponding feature of 5 principal components to Amount.Projection matrix Z is as follows:

Step 6: the random noise of addition is set.If privacy budget ε ∈ [0.1,1], projection matrix each column is distributed by dividing equally The privacy budget got isSusceptibility isRemember z_jIt is arranged for the jth of projection matrix Z, then the result after being added to random noise ForAdd the projection matrix Z ' after making an uproar as follows:

Step 7: exporting low-rank approximate matrix according to formula (5).

Step 8: assessment algorithm performance.Difference privacy principal component analysis effect is assessed using MSE-F, MSE-F is that low-rank is close Error between likelihood data and initial data, MSE-F is smaller, and algorithm availability is higher.

It is to be compared by the respectively distribution privacy budget of the invention used and by weight distribution privacy budget herein, compares Under identical privacy budget level, which kind of adds mode bring error of making an uproar smaller.Since Laplace noise is random noise, institute With each ε value of correspondence, every group of experiment is carried out 100 times, records MSE-F average value, as shown in Fig. 1.

As shown in Figure 1, under identical privacy budget level, what the present invention used divides equally distribution ratio by weight distribution bring Error is smaller, this illustrates that present invention availability of data under identical secret protection rank is higher, and privacy budget is bigger, error It is smaller.

In conclusion the invention proposes a kind of difference method for secret protection towards principal component analysis, by being original Data projection matrix each column Elemental partition privacy budget, is providing the noise for reducing addition while secret protection.The present invention It is possible to prevente effectively from the data to " inessential " add noise, the waste of privacy budget is reduced, so that the availability of data is improved, The data of publication are made to reflect truthful data as far as possible, the data publication and privacy for being applicable to different scales and different dimensions are protected Shield.

The above is only some embodiments of the invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims

1. a kind of difference method for secret protection towards principal component analysis, which is characterized in that it is based on preset sample data set X, Number of samples n, sample space dimension d；Principal component analytical method the following steps are included:

Step 2: calculating covariance matrix with the data matrix that step 1 obtainsWherein, X^TIt is turning for data matrix X It sets；

Step 3: calculating the eigenvalue λ and feature vector V of covariance matrix A described in step 2, meet AV=λ V；By characteristic value Descending is arranged with: λ₁>λ₂…>λ_d, corresponding feature vector is v₁,v₂…v_d；

Step 4: calculating the principal component number k of reservation；

2. the difference method for secret protection according to claim 1 towards principal component analysis, which is characterized in that in step 1, For convenience of covariance matrix is solved, each dimension mean value is 0 after centralization, goes mean value to each attribute, as shown in formula (1):

x_jIt is the data of j-th of attribute of all samples, x '_jIt is the data of j-th of attribute of all samples after centralization, x_ijIt is data Collect the data of i-th of sample, j-th of attribute in X,It is the mean value of j-th of attribute.

3. the difference method for secret protection according to claim 1 towards principal component analysis, which is characterized in that in step 4, To an eigenvalue contribution value α of setting, wherein 0≤α≤1 calculates the principal component number k to be retained, it is made to meet practical protect The principal component eigenvalue contribution value per >=α stayed, in which:

4. the difference method for secret protection according to claim 1 towards principal component analysis, which is characterized in that in step 5, The projection matrix Z=XV_kIt is mapping of the initial data on principal component space, wherein V_k=v₁,v₂…v_kIt is the k master retained The corresponding feature vector of ingredient.

5. the difference method for secret protection according to claim 1 towards principal component analysis, which is characterized in that in step 6, The random noise is Laplace noise, i.e. noise obeys Laplace distribution Lap (b), and b is scale parameter, b=Δ f/ ε, Δ F is global susceptibility, and ε is privacy budget；

Projection matrix Z=XV_kJth column indicate mapping of the initial data in j-th of principal component, each column expression is different to be contained Justice can distribute equal or unequal privacy budget ε_j, wherein 1≤j≤k.

6. the difference method for secret protection according to claim 5 towards principal component analysis, which is characterized in that distribution is equal Privacy budget ε_j: divide equally:Each column distribute equal privacy budget；

The privacy budget ε that distribution does not wait_j: press weight distribution:Privacy is distributed according to principal component characteristic value accounting Budget.

7. the difference method for secret protection according to claim 1 towards principal component analysis, which is characterized in that in step 7, Adding the projection matrix after making an uproar is Z '=(z '₁,z′₂…z′_j…z′_k), wherein z '_jExpression formula it is as follows:

8. the difference method for secret protection according to claim 2 towards principal component analysis, which is characterized in that in step 8, Low-rank approximate matrix It is eigenvectors matrix V_kTransposition,It is the mean value of attribute, wherein

Approximate data error is calculated using formula (5)；

MSE-F=| | Y-X | |_F (5)

The matrix of a m × n is let c be, then the F norm of C are as follows: