CN109492428A - A kind of difference method for secret protection towards principal component analysis - Google Patents

A kind of difference method for secret protection towards principal component analysis Download PDF

Info

Publication number
CN109492428A
CN109492428A CN201811265579.9A CN201811265579A CN109492428A CN 109492428 A CN109492428 A CN 109492428A CN 201811265579 A CN201811265579 A CN 201811265579A CN 109492428 A CN109492428 A CN 109492428A
Authority
CN
China
Prior art keywords
data
principal component
matrix
secret protection
projection matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811265579.9A
Other languages
Chinese (zh)
Inventor
杨庚
徐亚红
汪伟亚
蒋辰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Post and Telecommunication University
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN201811265579.9A priority Critical patent/CN109492428A/en
Publication of CN109492428A publication Critical patent/CN109492428A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/21Indexing scheme relating to G06F21/00 and subgroups addressing additional information or applications relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/2107File encryption

Abstract

The invention discloses a kind of difference method for secret protection towards principal component analysis, comprising the following steps: data matrix centralization, i.e., each dimension data subtract the mean value of this dimension;Covariance matrix is calculated to data matrixCalculate the eigenvalue λ and feature vector V of covariance matrix A;Calculate the principal component number k retained;Initial data is mapped to principal component space and obtains projection matrix Z;To the projection matrix Z each column Elemental partition privacy budget εj, calculate the random noise of addition;Noise is added to the projection matrix Z, the projection matrix Z ' after obtaining plus making an uproar;Calculate the error between initial data and low-rank approximate data.The present invention both can realize the simplification of data, and can add noise to avoid the data to " inessential " effectively to data set dimensionality reduction; reduce the waste of privacy budget; to improve the availability of data, the data of publication are made to reflect truthful data as far as possible, while protecting the privacy of data.

Description

A kind of difference method for secret protection towards principal component analysis
Technical field
The present invention relates to a kind of difference method for secret protection towards principal component analysis, belongs to field of information security technology.
Background technique
With the continuous development of big data technology, the data of various information system storages are more and more abundant, increase data Analyze the complexity of processing.As one of the important method of data analysis, principal component analysis can be converted to multivariable several Primary variables, these primary variables can indicate most information of initial data, disclose data essence.Principal component analysis is real Showed the simplification of data so that data be easier to using while reduce the computing cost of algorithm.It is generally comprised in data set Many privacy informations, if directly analyzing data using machine learning or data mining algorithm, it will bring privacy leakage problem. Difference method for secret protection is a kind of current secret protection technology of hot topic, is realized by noise mechanism, i.e., into output result Random noise is added to protect data safety, the noise of addition is bigger, and data are safer, however, the availability of data is lower, instead ?.
For multiattribute data, privacy budget of traditional Laplce's mechanism to all properties distribution same size, this Scheme is simple to operation, but the noise that will lead to addition is too big, and availability of data drastically reduces, while giving the number of " inessential " According to distribution privacy budget, a part of privacy budget is wasted, therefore the effect is unsatisfactory.
Summary of the invention
Problem to be solved by this invention provides one kind towards principal component analysis aiming at the defects of background technique Difference method for secret protection, the present invention both can realize the simplification of data, and can be to avoid right effectively to data set dimensionality reduction The data of " inessential " add noise, reduce the waste of privacy budget, to improve the availability of data, keep the data of publication most It may reflect truthful data, while protect the privacy of data.
To solve the above-mentioned problems, it adopts the following technical scheme that
A kind of difference method for secret protection towards principal component analysis of the invention is based on preset sample data set X, sample This number n, sample space dimension d;Principal component analytical method the following steps are included:
Step 1: data matrix centralization, i.e., each dimension data subtract the mean value of this dimension;
Step 2: calculating covariance matrix with the data matrix that step 1 obtainsWherein, XTIt is data matrix X Transposition;
Step 3: calculating the eigenvalue λ and feature vector V of covariance matrix A described in step 2, meet AV=λ V;It will be special Value indicative descending is arranged with: λ12…>λd, corresponding feature vector is v1,v2…vd
Step 4: calculating the principal component number k of reservation;
Step 5: initial data being mapped to principal component space and obtains projection matrix Z;
Step 6: giving the projection matrix Z each column Elemental partition privacy budget εj, calculate the random noise of addition;
Step 7: adding noise to the projection matrix Z, the projection matrix Z ' after obtaining plus making an uproar;
Step 8: calculating the error between initial data and low-rank approximate data.
In step 1, for convenience of covariance matrix is solved, each dimension mean value is 0 after centralization, goes mean value to each attribute, As shown in formula (1):
xjIt is the data of j-th of attribute of all samples, x 'jIt is the data of j-th of attribute of all samples after centralization, xijIt is The data of i-th of sample, j-th of attribute in data set X,It is the mean value of j-th of attribute.
In step 4, to an eigenvalue contribution value α of setting, wherein 0≤α≤1 calculates the principal component number to be retained K makes it meet the principal component eigenvalue contribution value per >=α actually retained, in which:
In step 5, the projection matrix Z=XVkIt is mapping of the initial data on principal component space, wherein Vk=v1, v2…vkIt is the corresponding feature vector of k principal component retained.
In step 6, the random noise is Laplace noise, i.e. noise obeys Laplace distribution Lap (b), and b is scale Parameter, b=Δ f/ ε, Δ f are global susceptibility, and ε is privacy budget;
It is as follows to obey the Laplace distribution probability density function that scale parameter is b:
Wherein, x indicates all possible value, and p (x) is the probability of all values
Projection matrix Z=XVkJth column indicate mapping of the initial data in j-th of principal component, each column indicate difference Meaning, equal or unequal privacy budget ε can be distributedj, wherein 1≤j≤k.
Distribute equal privacy budget εj: divide equally:Each column distribute equal privacy budget;
The privacy budget ε that distribution does not waitj: press weight distribution:It is distributed according to principal component characteristic value accounting Privacy budget.
In step 7, adding the projection matrix after making an uproar is Z '=(z '1,z′2…z′j…z′k), wherein z 'j' expression formula such as Under:
zjIt is the jth column of projection matrix,It is the global susceptibility of projection matrix.
In step 8, low-rank approximate matrix It is eigenvectors matrix VkTransposition, It is the mean value of attribute, wherein
Approximate data error is calculated using formula (5);
MSE-F=| | Y-X | |F (5)
||·||FIt is the F norm of matrix;The F norm of matrix refers to the quadratic sum of matrix element evolution again;
The matrix of a m × n is let c be, then the F norm of C are as follows:
The present invention by adopting the above technical scheme, compared with prior art, has following technical effect that
The present invention is directed to the too big defect of traditional Laplce's mechanism addition noise, proposes that one kind more preferably adds the side of making an uproar Formula achievees the purpose that secret protection, while ensure that number so that the low-rank approximate data that reduction obtains is distorted to a certain extent According to availability.The method of the present invention is simple, easy to operate and do not limit data set size and attribute, and feature is as follows:
(1) safety for guarantee Principal Component Analysis Algorithm is devised by adding noise appropriate in projection matrix Principal Component Analysis Algorithm towards difference secret protection, and prove that algorithm meets difference privacy conditions;
(2) compared with traditional Laplce's mechanism, the program, which only adds the data of " important ", makes an uproar, and avoids privacy budget Waste.It is smaller to data addition noise under identical secret protection degree, to improve the availability of data, make the number of publication According to reflection truthful data as far as possible, while protecting the privacy of data.
Detailed description of the invention
Fig. 1 is used in experiment provided by the invention for testing the data of difference privacy Principal Component Analysis Algorithm performance Schematic diagram;
Fig. 2 is the work flow diagram of the difference method for secret protection provided by the invention towards principal component analysis.
Specific embodiment
The implementation of technical solution of the present invention is described in further detail with reference to the accompanying drawing, it should be understood that these examples It is only illustrative of the invention and is not intended to limit the scope of the invention, after the present invention has been read, those skilled in the art couple The modification of various equivalent forms of the invention falls within the application range as defined in the appended claims.
The present invention, which first calculates, retains principal component number, then initial data is mapped to principal component space and obtains projection matrix, For projection matrix each column Elemental partition privacy budget, Laplace noise of the addition in data is calculated, it both can effectively logarithm According to collection dimensionality reduction, the simplification of data is realized, and can add noise to avoid the data to " inessential ", reduce the wave of privacy budget Take, to improve the availability of data.It difference secret protection technical definition of the present invention one and its stringent attacks Hit model, and carried out stringent mathematical proof and quantificational expression to privacy risk, at the same difference privacy mechanism also can it is main at Good balance is obtained in terms of analysis result availability and safety two.
Referring to fig. 2, specific embodiment is as follows:
Step 1: collection obtains sample data set Secom.txt, storage be each attribute in semiconductor process number According to sample number 1567, attribute number is 591, data set X={ x1,x2…x591, xiIt is the number of all sample ith attributes According to.With formula (1) to every one-dimensional data centralization.10 attribute datas of data set after centralization are taken, as follows:
x1=[16.47710442,81.32710442, -81.84289558 ... -35.64289558, - 119.53289558-69.53289558]T
x50=[- 7.93969674, -0.99239674,5.01130326 ... -4.19689674,7.65940326, 7.02220326]T
x100=[- 0.0266401, -0.0173401,0.1202599 ... -0.0192401,0.1435599, - 0.0647401]T
x150=[- 2.54326790, -0.529267903, -1.99526790 ... -2.84217094e-14, 1.43873210-2.84217094e-14]T
x200=[- 0.91205637,0.11794363, -1.82205637 ... -7.61205637, -2.47205637, - 2.84205637]T
x250=[110.29433331,83.37773331, -5.24676669 ... 7.68593331, -10.22116669, 12.12073331]T
x300=[- 0.04006684, -0.00416684, -0.00196684 ... -0.02826684,0.02093316, - 0.03726684]T
x350=[2.14776410e-03, -2.25223590e-03, -4.45223590e-03 ... -3.46944695e- 17, -3.46944695e-17, -3.46944695e-17]T
x400=[- 0.9083303, -1.9865303, -0.2702303 ... 0.3510697, -1.0224303, 2.3229697]T
x450=[0.59278442, -0.23961558, -0.46731558 ... 0.38228442,1.83908442, 1.08908442]T
Step 2: calculating covariance matrix A with the data matrix that step 1 obtains.
Step 3: calculating the eigenvalue λ and feature vector V of step 2 covariance matrix A.Characteristic value descending is arranged, first 5 Characteristic value and feature vector are as follows:
λ1=53415197.85687523v1=[- 6.39070760e-04,2.35722934e-05,2.36801459e- 04,…,2.61329351e-08,5.62597732e-09,3.89298443e-04]T
λ2=21746671.90465921v2=[- 1.20314234e-04, -6.60163227e-04,1.58026311e- 04,…,-6.06233975e-09,5.96647587e-09,-2.32070657e-04]T
λ3=8248376.61529074v3=[1.22460363e-04,1.71369126e-03,3.28185512e- 04,…,1.09328336e-09,8.83024927e-09,7.13534990e-04]T
λ4=2073880.85929397v4=[- 2.72221201e-03,2.04941860e-04,4.20363040e- 04,…,2.66843972e-07,5.91392106e-08,-1.42694472e-03]T
λ5=1315404.38775829v5=[- 1.19198101e-05, -3.62618336e-03, -2.27104930e- 04,…,-3.24788891e-07,-9.39871716e-08,-3.98748600e-03]T
Step 4: being determined according to formula (2) and retain principal component number.Take eigenvalue contribution value α=95%, then require per >= 95%, calculate to obtain k=5.
Step 5: calculating projection matrix Z=XV5。V5=(v1,v2…v5) be retain the corresponding feature of 5 principal components to Amount.Projection matrix Z is as follows:
Step 6: the random noise of addition is set.If privacy budget ε ∈ [0.1,1], projection matrix each column is distributed by dividing equally The privacy budget got isSusceptibility isRemember zjIt is arranged for the jth of projection matrix Z, then the result after being added to random noise ForAdd the projection matrix Z ' after making an uproar as follows:
Step 7: exporting low-rank approximate matrix according to formula (5).
Step 8: assessment algorithm performance.Difference privacy principal component analysis effect is assessed using MSE-F, MSE-F is that low-rank is close Error between likelihood data and initial data, MSE-F is smaller, and algorithm availability is higher.
It is to be compared by the respectively distribution privacy budget of the invention used and by weight distribution privacy budget herein, compares Under identical privacy budget level, which kind of adds mode bring error of making an uproar smaller.Since Laplace noise is random noise, institute With each ε value of correspondence, every group of experiment is carried out 100 times, records MSE-F average value, as shown in Fig. 1.
As shown in Figure 1, under identical privacy budget level, what the present invention used divides equally distribution ratio by weight distribution bring Error is smaller, this illustrates that present invention availability of data under identical secret protection rank is higher, and privacy budget is bigger, error It is smaller.
In conclusion the invention proposes a kind of difference method for secret protection towards principal component analysis, by being original Data projection matrix each column Elemental partition privacy budget, is providing the noise for reducing addition while secret protection.The present invention It is possible to prevente effectively from the data to " inessential " add noise, the waste of privacy budget is reduced, so that the availability of data is improved, The data of publication are made to reflect truthful data as far as possible, the data publication and privacy for being applicable to different scales and different dimensions are protected Shield.
The above is only some embodiments of the invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims (8)

1. a kind of difference method for secret protection towards principal component analysis, which is characterized in that it is based on preset sample data set X, Number of samples n, sample space dimension d;Principal component analytical method the following steps are included:
Step 1: data matrix centralization, i.e., each dimension data subtract the mean value of this dimension;
Step 2: calculating covariance matrix with the data matrix that step 1 obtainsWherein, XTIt is turning for data matrix X It sets;
Step 3: calculating the eigenvalue λ and feature vector V of covariance matrix A described in step 2, meet AV=λ V;By characteristic value Descending is arranged with: λ12…>λd, corresponding feature vector is v1,v2…vd
Step 4: calculating the principal component number k of reservation;
Step 5: initial data being mapped to principal component space and obtains projection matrix Z;
Step 6: giving the projection matrix Z each column Elemental partition privacy budget εj, calculate the random noise of addition;
Step 7: adding noise to the projection matrix Z, the projection matrix Z ' after obtaining plus making an uproar;
Step 8: calculating the error between initial data and low-rank approximate data.
2. the difference method for secret protection according to claim 1 towards principal component analysis, which is characterized in that in step 1, For convenience of covariance matrix is solved, each dimension mean value is 0 after centralization, goes mean value to each attribute, as shown in formula (1):
xjIt is the data of j-th of attribute of all samples, x 'jIt is the data of j-th of attribute of all samples after centralization, xijIt is data Collect the data of i-th of sample, j-th of attribute in X,It is the mean value of j-th of attribute.
3. the difference method for secret protection according to claim 1 towards principal component analysis, which is characterized in that in step 4, To an eigenvalue contribution value α of setting, wherein 0≤α≤1 calculates the principal component number k to be retained, it is made to meet practical protect The principal component eigenvalue contribution value per >=α stayed, in which:
4. the difference method for secret protection according to claim 1 towards principal component analysis, which is characterized in that in step 5, The projection matrix Z=XVkIt is mapping of the initial data on principal component space, wherein Vk=v1,v2…vkIt is the k master retained The corresponding feature vector of ingredient.
5. the difference method for secret protection according to claim 1 towards principal component analysis, which is characterized in that in step 6, The random noise is Laplace noise, i.e. noise obeys Laplace distribution Lap (b), and b is scale parameter, b=Δ f/ ε, Δ F is global susceptibility, and ε is privacy budget;
It is as follows to obey the Laplace distribution probability density function that scale parameter is b:
Wherein, x indicates all possible value, and p (x) is the probability of all values
Projection matrix Z=XVkJth column indicate mapping of the initial data in j-th of principal component, each column expression is different to be contained Justice can distribute equal or unequal privacy budget εj, wherein 1≤j≤k.
6. the difference method for secret protection according to claim 5 towards principal component analysis, which is characterized in that distribution is equal Privacy budget εj: divide equally:Each column distribute equal privacy budget;
The privacy budget ε that distribution does not waitj: press weight distribution:Privacy is distributed according to principal component characteristic value accounting Budget.
7. the difference method for secret protection according to claim 1 towards principal component analysis, which is characterized in that in step 7, Adding the projection matrix after making an uproar is Z '=(z '1,z′2…z′j…z′k), wherein z 'jExpression formula it is as follows:
zjIt is the jth column of projection matrix,It is the global susceptibility of projection matrix.
8. the difference method for secret protection according to claim 2 towards principal component analysis, which is characterized in that in step 8, Low-rank approximate matrix It is eigenvectors matrix VkTransposition,It is the mean value of attribute, wherein
Approximate data error is calculated using formula (5);
MSE-F=| | Y-X | |F (5)
||·||FIt is the F norm of matrix;The F norm of matrix refers to the quadratic sum of matrix element evolution again;
The matrix of a m × n is let c be, then the F norm of C are as follows:
CN201811265579.9A 2018-10-29 2018-10-29 A kind of difference method for secret protection towards principal component analysis Pending CN109492428A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811265579.9A CN109492428A (en) 2018-10-29 2018-10-29 A kind of difference method for secret protection towards principal component analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811265579.9A CN109492428A (en) 2018-10-29 2018-10-29 A kind of difference method for secret protection towards principal component analysis

Publications (1)

Publication Number Publication Date
CN109492428A true CN109492428A (en) 2019-03-19

Family

ID=65691791

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811265579.9A Pending CN109492428A (en) 2018-10-29 2018-10-29 A kind of difference method for secret protection towards principal component analysis

Country Status (1)

Country Link
CN (1) CN109492428A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110334546A (en) * 2019-07-08 2019-10-15 辽宁工业大学 Difference privacy high dimensional data based on principal component analysis optimization issues guard method
CN111241582A (en) * 2020-01-10 2020-06-05 鹏城实验室 Data privacy protection method and device and computer readable storage medium
CN112560094A (en) * 2020-12-18 2021-03-26 湖南大学 Dual optimization-based high-availability graph data privacy protection method
CN114710259A (en) * 2022-03-22 2022-07-05 中南大学 Multi-party combined safety PCA projection method and data correlation analysis method
CN116761164A (en) * 2023-08-11 2023-09-15 北京科技大学 Privacy data transmission method and system based on matrix completion

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107451954A (en) * 2017-05-23 2017-12-08 南京邮电大学 Iterated pixel interpolation method based on image low-rank property

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107451954A (en) * 2017-05-23 2017-12-08 南京邮电大学 Iterated pixel interpolation method based on image low-rank property

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
戚名钰等: "采用成分分析的差分隐私数据发布算法", 《小型微型计算机系统》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110334546A (en) * 2019-07-08 2019-10-15 辽宁工业大学 Difference privacy high dimensional data based on principal component analysis optimization issues guard method
CN110334546B (en) * 2019-07-08 2021-11-23 辽宁工业大学 Difference privacy high-dimensional data release protection method based on principal component analysis optimization
CN111241582A (en) * 2020-01-10 2020-06-05 鹏城实验室 Data privacy protection method and device and computer readable storage medium
CN111241582B (en) * 2020-01-10 2022-06-10 鹏城实验室 Data privacy protection method and device and computer readable storage medium
CN112560094A (en) * 2020-12-18 2021-03-26 湖南大学 Dual optimization-based high-availability graph data privacy protection method
CN114710259A (en) * 2022-03-22 2022-07-05 中南大学 Multi-party combined safety PCA projection method and data correlation analysis method
CN114710259B (en) * 2022-03-22 2024-04-19 中南大学 Multi-party combined safety PCA projection method and data correlation analysis method
CN116761164A (en) * 2023-08-11 2023-09-15 北京科技大学 Privacy data transmission method and system based on matrix completion
CN116761164B (en) * 2023-08-11 2023-11-14 北京科技大学 Privacy data transmission method and system based on matrix completion

Similar Documents

Publication Publication Date Title
CN109492428A (en) A kind of difference method for secret protection towards principal component analysis
Naresh Kumar et al. A new methodology for estimating internal credit risk and bankruptcy prediction under Basel II Regime
Johnstone Approximate null distribution of the largest root in multivariate analysis
Dos Santos et al. A canonical correlation analysis of the relationship between sustainability and competitiveness
Zhu et al. Operational risk measurement: a loss distribution approach with segmented dependence
Stoklosa et al. Fast forward selection for generalized estimating equations with a large number of predictor variables
CN111007018B (en) Background estimation method and system for spectrum gas detection
CN115601182A (en) Data analysis method, pricing method and related equipment based on improved XGboost method
Chen et al. Adaptive structural reliability analysis method based on confidence interval squeezing
Safdari et al. The relationship between military expenditure and economic growth in four Asian countries
Nkurunziza et al. Estimation strategies for the regression coefficient parameter matrix in multivariate multiple regression
Amati et al. Survival analysis for freshness in microblogging search
Poskitt On singular spectrum analysis and stepwise time series reconstruction
Sabourin et al. Combining regional estimation and historical floods: A multivariate semiparametric peaks‐over‐threshold model with censored data
Turkay et al. Correlation stress testing for value-at-risk
Tong et al. Variable selection for panel count data via non‐concave penalized estimating function
Xu et al. Lower bound approximation to basket option values for local volatility jump-diffusion models
Cardozo et al. Generalized log-gamma additive partial linear models with P-spline smoothing
Genest et al. Basel 2 IRB Risk Weight Functions: Demonstration & Analysis
Bloch Fast calibration of the Affine and Quadratic models
Landsman et al. Efficient analysis of case‐control studies with sample weights
Ayusuk et al. Copula based volatility models and extreme value theory for portfolio simulation with an application to asian stock markets
Hu et al. Utility‐based shortfall risk: Efficient computations via Monte Carlo
Kim Swaption pricing in affine and other models
Oh et al. Age‐Period‐Cohort approaches to back‐calculation of cancer incidence rate

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190319

RJ01 Rejection of invention patent application after publication