Disclosure of Invention
In view of the above, the present specification provides a differential privacy protection method for sensitive data, where the sensitive data is a matrix X of dimensions n × d, X is decomposed into a product of a matrix P of dimensions n × k and a matrix Q of dimensions k × d, and n, k, and d are natural numbers, the method including:
according to the value ranges of X, P and Q, determining that the value is not less than
The supremum of (B); x is the number of
iIs the ith row of the matrix X, i is a natural number from 1 to n;
given P, Q, x
iA likelihood function of (a);
in proportion to
Posterior distribution of
Sampling, wherein P and Q obtained by sampling are output data meeting the epsilon-difference privacy;
is a priori distribution over P and Q.
The data mining method for differential privacy protection provided by the specification comprises the following steps:
acquiring a matrix P with dimensions of n x k; the matrix P is proportional to
Posterior distribution of
Sampling to obtain; x is a matrix with n X d dimensions and can be decomposed into a product of a matrix P and a matrix Q with k X d dimensions; x is the number of
iIs the ith row of the matrix X;
given P, Q, x
iA likelihood function of (a);
is a prior distribution over P and Q; b is determined according to the value ranges of X, P and Q and is not less than
The supremum of the maximum value of (a); n, k and d are natural numbers, i is a natural number from 1 to n, and epsilon is a differential privacy protection parameter;
and generating a training sample by adopting the matrix P, and training the data mining model.
The present specification also provides an apparatus for differential privacy protection of sensitive data, where the sensitive data is a matrix X of dimensions n × d, X is decomposable to a product of a matrix P of dimensions n × k and a matrix Q of dimensions k × d, and n, k, and d are natural numbers, the apparatus including:
a supremum determining unit for determining not less than the value range of X, P and Q
The supremum of (B); x is the number of
iIs the ith row of the matrix X, i is a natural number from 1 to n;
given P, Q, x
iA likelihood function of (a);
a posterior sampling unit for proportional to
Posterior distribution of
Sampling, wherein P and Q obtained by sampling are output data meeting the epsilon-difference privacy;
is a priori distribution over P and Q.
The present specification provides a data mining apparatus with differential privacy protection, including:
a protection data acquisition unit for acquiring a matrix P of n × k dimensions; the matrix P is proportional to
Posterior distribution of
Sampling to obtain; x is a matrix with n X d dimensions and can be decomposed into a product of a matrix P and a matrix Q with k X d dimensions; x is the number of
iIs the ith row of the matrix X;
given P, Q, x
iA likelihood function of (a);
is a prior distribution over P and Q; b is determined according to the value ranges of X, P and Q and is not less than
The supremum of the maximum value of (a); n, k and d are natural numbers, i is a natural number from 1 to n, and epsilon is a differential privacy protection parameter;
and the model training unit is used for generating training samples by adopting the matrix P and training the data mining model.
This specification provides a computer device comprising: a memory and a processor; the memory having stored thereon a computer program executable by the processor; and when the processor runs the computer program, executing the steps of the differential privacy protection method for the sensitive data.
This specification provides a computer device comprising: a memory and a processor; the memory having stored thereon a computer program executable by the processor; and when the processor runs the computer program, the steps of the data mining method for differential privacy protection are executed.
The present specification provides a computer readable storage medium, on which a computer program is stored, and the computer program is executed by a processor to perform the steps of the differential privacy protection method for sensitive data.
The present specification also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the differential privacy preserving data mining method described above.
As can be seen from the above technical solutions, in the embodiments of the present specification, the sensitive data matrix X is decomposed into a matrix P and a matrix Q, and is determined to be not less than
The supremum B of (P, Q) satisfies the 4B-difference privacy characteristic by using the statistic obtained by the posterior distribution sampling of (P, Q), and is constructed in proportion to
Posterior distribution of
Sampling of P and Q is carried out, so that a matrix decomposition result meeting the epsilon-difference privacy can be obtained without matrix inversion, and large-scale privacy data can be protected fast and efficiently; meanwhile, the method is suitable for the matrix X distributed at will, and can be widely applied to various scenes.
Detailed Description
Matrix Factorization (MF), which breaks down an original Matrix into products of several matrices, is a dimension reduction technique. The matrix decomposition technique achieves the purpose of compressing, representing and approximating the original matrix by finding the effective low-dimensional features in the original matrix. Thus, the low rank matrix resulting from the matrix decomposition technique is able to retain most of the information content of the original matrix, but is different from the original matrix, so that the matrix decomposition technique can be used to process sensitive data.
Assuming that the sensitive data has n (n is a natural number) records, and each record has d (d is a natural number) data items, the sensitive data can be expressed as a matrix X with n X d dimensions. Matrix decomposition is performed on X, that is, an n × k (k is a natural number) dimensional matrix P and a k × d dimensional matrix Q satisfying formula 1 are found:
x is PQ formula 1
The low rank matrix P or Q is an approximation of the sensitive data (i.e., matrix X), which retains a large amount of information in the sensitive data. If it is desired to use the low rank matrix P or Q as privacy preserving data, it is also necessary to satisfy the condition that it is difficult to reversely derive the sensitive data matrix by the low rank matrix, and the matrix P or Q satisfying the differential privacy requirement may satisfy the above condition.
Embodiments of the present description provide a new differential privacy protection method for sensitive data and a new data mining method for differential privacy protection, which set values for P and QRange of not less than
Is taken as a supremum B, and the statistics based on the (P, Q) posterior distribution satisfy the 4B-differential privacy characteristic, which is proportional to
The (P, Q) posterior distribution of (e) is sampled to obtain P and Q satisfying e-differential privacy. The embodiment of the specification does not relate to inverse matrix sampling for posterior distribution, can quickly and efficiently obtain a decomposed matrix, and is suitable for processing large-scale data or online data; and there is no requirement for the distribution function of the sensitive data.
Embodiments of the present description may be implemented on any device with computing and storage capabilities, such as a mobile phone, a tablet Computer, a PC (Personal Computer), a notebook, a server, and so on; the functions in the embodiments of the present specification may also be implemented by a logical node operating in two or more devices.
In the embodiment of the present specification, a flow of a differential privacy protection method for sensitive data is shown in fig. 1.
Step 110, according to the value ranges of X, P and Q, determining that the value is not less than
Upper limit B of the maximum value of (a).
In the embodiments of the present disclosure, the matrices X, P and Q have respective ranges. That is, for matrix X, there are two real numbers maxXAnd minXAnd satisfies formula 2:
minX≤xi,j≤maxXformula 2
In formula 2, xi,jIs the element in ith row and jth column of matrix X, i is a natural number from 1 to n, and j is a natural number from 1 to d.
Similarly, there are two such real numbers for matrices P and Q, respectively.
In an actual application scene, most of original data from a service system has a value range; even for individual specific data items without value ranges, the data items with the value ranges can be converted into the data items with the value ranges through some simple processing modes, and the meaning represented by the data items is not influenced. For example, a data item without a value range can be mapped to a data item with a value range through a proper mathematical function; for another example, the value range may be defined by dividing the data items without value range into several levels, and representing the specific values with the level values.
In one implementation, the raw data may be represented as a matrix X in n X d dimensionspIn the pair XpAnd taking the matrix X obtained after normalization processing according to the columns as a sensitive data matrix. Thus, xi,jHas a value range of [0,1 ]]。
For the matrices P and Q, the value ranges of the elements can be preset. Because the information contained in the matrix X is embodied on the relative value between the elements in P and Q, the limit value range does not influence the approximation degree of the information carried in P and Q and the information in X.
Is x at P, Q
iOf a likelihood function of (2), wherein x
iIs the ith row of the matrix X; indicating the likelihood that X is observed given P and Q.
Is a logarithmic form of the likelihood function.
In the embodiment of the specification, the matrix X obeys a certain distribution, and the form of the distribution function of the matrix X can be predetermined according to factors such as the requirement of an actual application scene, the characteristics of sensitive data and the like; the parameters of the X distribution function may be determined using sensitive data according to the prior art, and are not described in detail.
For a matrix X that follows any one of the distributions, after the form and parameters of its distribution function have been determined,
will be composed of x
iThe value of the element(s) in P and Q, and the form and parameters of the distribution function of X. Thus, can be based on x
iThe value ranges of the elements in P and Q and the parameters of the X distribution function to obtain
To thereby determine the range of values of
Is measured.
How to determine X is based on the normal distribution is described below as an example
Is measured. Assuming that the standard deviation of the normal distribution of X is σ, equation 3 holds:
in formula 3, QTIs a transposed matrix of Q, piIs line i of P, qjColumn j of Q, Const is determined by the mean and standard deviation σ of the normal distribution of X, and is equal to XiA constant independent of the values of the elements in P and Q.
It can be seen that x
iThe value ranges of the elements in P and Q are applied to the formula 3, and then- (x) can be determined
i,j-p
i·q
j)
2Range of possible values, thereby obtaining
Range of values of (a). For X that follows a normal distribution,
is determined by the value ranges of the matrix X, P and Q, and the mean and variance of the normal distribution to which X follows.
The matrix X having other forms of distribution functionsWhen, for example, a position (poisson) distribution, a discrete distribution, a beta distribution, etc., the corresponding distribution can also be distributed by means of this form
The expression of (2) is obtained by the value ranges and distribution parameters of the elements in X, P and Q
Is measured. And will not be described in detail.
In the embodiments of the present specification, it may be not less than
A certain value of the maximum value of (B) as the supremum B. In other words, the supremum boundary B satisfies equation 4, i.e., for all possible x
iB satisfies the condition of being not less than maximum
Step 120, in proportion to
Posterior distribution of
Sampling, wherein P and Q obtained by sampling are output data meeting the epsilon-difference privacy;
is a prior distribution of P and Q.
Suppose that one of the matrices X is recorded as XkIs replaced by x'kIf we get matrix X', let θ be (P, Q), then sample an arbitrary a posteriori distribution of (P, Q):
can be seen from any
The sampled statistics satisfy 4B-differential privacy. Then for any differential privacy protection parameter e, from being proportional to
Posterior distribution of
Sampling is performed as shown in equation 5:
any P and Q obtained by sampling according to the formula 5 meet the requirement of epsilon-differential privacy and can be used as the output of differential privacy matrix decomposition.
In the embodiments of the present description, any sampling method may be used for sampling, and is not limited. For example, a Markov chain Monte Carlo (Markov chain Monte Carlo), a random Gradient Hamilton Monte Carlo (Stochastic Gradient Hamiltonian Monte Carlo), or the like can be used as the sampling method.
Matrices P and Q, which are full e-differential privacy requirements, can be used as data for which privacy protection has been accomplished. For example, in one example, the n × k dimensional matrix P may be provided to a data mining party as a data source or a part of the data source for data mining, so as to protect sensitive data and learn a new model by using information carried by the sensitive data as an input of a machine learning model.
In the embodiment of the present specification, a flow of a data mining method for differential privacy protection is shown in fig. 2.
In
step 210, a matrix P of dimensions n x k is obtained. Matrix P is proportional to
Posterior distribution of
Sampling to obtain; e is a differential privacy protection parameter; x is a matrix with n X d dimensions and can be decomposed into a product of a matrix P and a matrix Q with k X d dimensions; x is the number of
iIs the ith row of the matrix X;
given P, Q, x
iA likelihood function of (a);
is a prior distribution over P and Q; b is determined according to the value ranges of X, P and Q and is not less than
The supremum of (a).
By using the differential privacy protection method for sensitive data in the embodiment of the present specification, a party having sensitive data decomposes a sensitive data matrix X into matrices P and Q, and then provides the matrix P with dimensions n × k as a data source after differential privacy protection to a party performing data mining. The party performing data mining may obtain the matrix P from the party owning sensitive data in any way, and the embodiments of the present specification are not limited.
And step 220, training the data mining model by using the obtained training samples.
After obtaining the matrix P, the party performing data mining may train the data mining model by using the matrix P as a training sample with a capacity of n (i.e., using each row of P as a data record); or the matrix P may be used as a partial data source, and after data fusion is performed with other data sources, a training sample is generated, and then the generated training sample is used to train the data mining model.
The specific data fusion mode can be determined according to factors such as the characteristics of data in an actual application scene, the type of a data mining model and the like, and is not limited; for example, assuming that the matrix P contains the differential privacy protection data of n users, for a trusted data mining party, t (t is a natural number) data items without sensitive information of the same n users may be spliced with the matrix P to form n × t + t training samples.
In the embodiments of the present specification, the type of the data mining model and the specific training method are not limited.
It can be seen that in the embodiments of the present specification, the sensitive data matrix X is decomposed into a matrix P and a matrix Q having a certain value range so as to be not less than
Is proportional to the value of the maximum value of (P, Q) as the supremum B, satisfies the 4B-differential privacy characteristic using the statistics obtained from the posterior distribution sampling of (P, Q)
The (P, Q) posterior distribution of (A, B) is subjected to P and Q sampling, so that a matrix decomposition result meeting the e-difference privacy can be obtained without matrix inversion, a decomposition matrix of large-scale data or online data can be obtained quickly and efficiently, and no requirement is imposed on a distribution function of X.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
In one application example of the specification, a data provider entrusts a trusted data miner to mine user data, and provides the data miner with a data source required for partial mining. The data provider is the party who grasps the original data with user sensitive information, and the data miner grasps non-sensitive information by the same user group. Before the data mining party provides the data to the data mining party, the original data needs to be subjected to differential privacy protection. The specific data processing procedure is shown in fig. 3.
At a data provider, d sensitive data items of n users are constructed into an n-X-d original data matrix, and the matrix X is obtained after the original data matrix is subjected to normalization processing according to columns. The matrix X follows a normal distribution with mean μ and standard deviation σ.
At the data provider, according to equation 3, the determination is made
And B satisfying equation 4 is taken as a supremum. Then according to equation 5, is proportional to
Posterior distribution of
And sampling to obtain P and Q meeting the requirement of epsilon-difference privacy.
The data provider provides the matrix P to the data miner.
At the data mining side, t non-sensitive data items of the same n users are constructed into a non-sensitive data matrix with n x t dimensions. After the data mining side obtains the matrix P, the matrix P is obtained for each PiAnd the affiliated user performs data fusion on the non-sensitive data matrix and the matrix P, and generates an n-x (k + t) -dimensional matrix Y after the data fusion, wherein each row of the Y comprises t non-sensitive data items of the same user and k data items of the user in the matrix P.
And training the data mining model by taking the matrix Y as a training sample at a data mining party to obtain a result model.
Corresponding to the implementation of the above flow, embodiments of the present specification further provide a differential privacy protection device for sensitive data and a data mining device for differential privacy protection. Both of these means can be implemented by software, or by hardware, or by a combination of hardware and software. Taking a software implementation as an example, the logical device is formed by reading a corresponding computer program instruction into a memory for running through a Central Processing Unit (CPU) of the device. In terms of hardware, the device in which the apparatus is located generally includes other hardware such as a chip for transmitting and receiving wireless signals and/or other hardware such as a board for implementing a network communication function, in addition to the CPU, the memory, and the storage shown in fig. 4.
Fig. 5 illustrates a differential privacy protection apparatus for sensitive data, according to an embodiment of the present disclosure, where the sensitive data is a matrix X of dimensions n × d, X may be decomposed into a product of a matrix P of dimensions n × k and a matrix Q of dimensions k × d, and n, k, and d are natural numbers, the apparatus includes an infimum determination unit and an a posteriori sampling unit, where: the supremum determining unit is used for determining that the value is not less than the value range of X, P and Q
The supremum of (B); x is the number of
iIs the ith row of the matrix X, i is a natural number from 1 to n;
given P, Q, x
iA likelihood function of (a); the posterior sampling unit is used for proportional
Posterior distribution of
Sampling is carried out, and the P and Q obtained by sampling are output numbers meeting the epsilon-difference privacyAccordingly;
is a priori distribution over P and Q.
Optionally, the apparatus further includes a sampling data output unit, configured to provide the matrix P to a data mining party, so that the data mining party performs data mining on the matrix P as at least part of the data source.
Optionally, the apparatus further includes a normalization processing unit, configured to perform normalization processing on the matrix X by columns before determining the supremum boundary B.
Optionally, X follows a normal distribution; the above-mentioned
Is determined according to the value ranges of X, P and Q, and the mean and variance of the normal distribution.
Optionally, the posterior sampling unit is specifically configured to: in proportion to
Posterior distribution of
And carrying out sampling by adopting a Markov chain Monte Carlo sampling method or a random gradient Hamilton Monte Carlo sampling method.
Fig. 6 is a diagram illustrating a data mining apparatus for differential privacy protection according to an embodiment of the present disclosure, where the data mining apparatus includes a protected data obtaining unit and a model training unit, where: the protection data acquisition unit is used for acquiring a matrix P with n x k dimensions; the matrix P is proportional to
Posterior distribution of
Sampling to obtain; x is a matrix with n X d dimensions and can be decomposed into a product of a matrix P and a matrix Q with k X d dimensions; x is the number of
iIs the ith row of the matrix X;
given P, Q, x
iA likelihood function of (a);
is a prior distribution over P and Q; b is determined according to the value ranges of X, P and Q and is not less than
The supremum of the maximum value of (a); n, k and d are natural numbers, i is a natural number from 1 to n, and epsilon is a differential privacy protection parameter; and the model training unit is used for generating training samples by adopting the matrix P and training the data mining model.
Optionally, the model training unit is specifically configured to: and taking the matrix P as a partial data source, performing data fusion with other data sources to generate a training sample, and training the data mining model.
Embodiments of the present description provide a computer device that includes a memory and a processor. Wherein the memory has stored thereon a computer program executable by the processor; the processor, when executing the stored computer program, performs the steps of the differential privacy protection method for sensitive data in the embodiments of the present specification. For a detailed description of the steps of the differential privacy protection method for sensitive data, reference is made to the preceding contents, which are not repeated.
Embodiments of the present description provide a computer device that includes a memory and a processor. Wherein the memory has stored thereon a computer program executable by the processor; the processor, when executing the stored computer program, performs the steps of the differential privacy preserving data mining method of the embodiments of the present specification. For a detailed description of the steps of the data mining method for differential privacy protection, refer to the previous contents and are not repeated.
Embodiments of the present description provide a computer-readable storage medium having stored thereon computer programs which, when executed by a processor, perform the steps of the differential privacy protection method for sensitive data in embodiments of the present description. For a detailed description of the steps of the differential privacy protection method for sensitive data, reference is made to the preceding contents, which are not repeated.
Embodiments of the present description provide a computer-readable storage medium having stored thereon computer programs which, when executed by a processor, perform the steps of the differential privacy preserving data mining method of embodiments of the present description. For a detailed description of the steps of the data mining method for differential privacy protection, refer to the previous contents and are not repeated.
The above description is only exemplary of the present invention and should not be taken as limiting the scope of the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.