CN109409117B - Differential privacy protection method and device for sensitive data - Google Patents

Differential privacy protection method and device for sensitive data Download PDF

Info

Publication number
CN109409117B
CN109409117B CN201710697388.9A CN201710697388A CN109409117B CN 109409117 B CN109409117 B CN 109409117B CN 201710697388 A CN201710697388 A CN 201710697388A CN 109409117 B CN109409117 B CN 109409117B
Authority
CN
China
Prior art keywords
matrix
data
sampling
dimensions
data mining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710697388.9A
Other languages
Chinese (zh)
Other versions
CN109409117A (en
Inventor
刘子奇
周俊
李小龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Original Assignee
Advanced New Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Advanced New Technologies Co Ltd filed Critical Advanced New Technologies Co Ltd
Priority to CN201710697388.9A priority Critical patent/CN109409117B/en
Publication of CN109409117A publication Critical patent/CN109409117A/en
Application granted granted Critical
Publication of CN109409117B publication Critical patent/CN109409117B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes

Abstract

The present specification provides a differential privacy protection method for sensitive data, where the sensitive data is a matrix X of dimensions n × d, X is decomposed into a product of a matrix P of dimensions n × k and a matrix Q of dimensions k × d, and n, k, and d are natural numbers, the method including: according to the value ranges of X, P and Q, determining that the value is not less than
Figure DDA0001379463570000011
The supremum of (B); x is the number ofiIs the ith row of the matrix X, i is a natural number from 1 to n;
Figure DDA0001379463570000012
given P, Q, xiA likelihood function of (a); in proportion to
Figure DDA0001379463570000013
Posterior distribution of
Figure DDA0001379463570000014
Sampling, wherein P and Q obtained by sampling are output data meeting the epsilon-difference privacy;
Figure DDA0001379463570000015
is a priori distribution over P and Q.

Description

Differential privacy protection method and device for sensitive data
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a differential privacy protection method and apparatus for sensitive data, and a data mining method and apparatus for differential privacy protection.
Background
With the development and popularization of the internet, various activities performed on the basis of the network generate data continuously, and a lot of enterprises, governments, even individuals and the like master a lot of user data. The data mining technology can find valuable knowledge, modes, rules and other information from a large amount of data, provides auxiliary support for scientific research, business decision, process control and the like, and becomes an important mode for data utilization.
In some application scenarios, the data used for mining contains a lot of sensitive information, such as data of the financial industry, data of government departments, and the like. How to protect the privacy of the sensitive information in the data mining process becomes an increasingly concerned problem.
Differential Privacy (DP) defines a model that quantifies the risk of leakage of sensitive information formally, if there is a positive real e>0, a randomization algorithm
Figure BDA0001379463550000011
Having an input field of
Figure BDA0001379463550000012
For arbitrary input data sets
Figure BDA0001379463550000013
And X, Y differ by only one record, an
Figure BDA0001379463550000014
If equation 1 holds, the algorithm
Figure BDA0001379463550000015
Satisfy ∈ -differential privacy.
Figure BDA0001379463550000016
In formula 1, P (-) represents the probability of sensitive information being revealed; the element belongs to the differential privacy protection parameter and represents the strength of the protection capability, and the larger the element belongs to, the worse the protection capability is, and the smaller the protection capability is, the better the protection capability is.
Therefore, the difference privacy limits any record pair algorithm
Figure BDA0001379463550000017
The influence of the output result is made to pass through an analysis algorithm
Figure BDA0001379463550000018
The output of (a) can be obtained from a record in the input data set, which is almost comparable to the information that can be obtained from an input data set without such a record. When the differential privacy technology is applied to an actual scene, the difficulty is to design an efficient algorithm capable of processing large-scale data.
Disclosure of Invention
In view of the above, the present specification provides a differential privacy protection method for sensitive data, where the sensitive data is a matrix X of dimensions n × d, X is decomposed into a product of a matrix P of dimensions n × k and a matrix Q of dimensions k × d, and n, k, and d are natural numbers, the method including:
according to the value ranges of X, P and Q, determining that the value is not less than
Figure BDA0001379463550000021
The supremum of (B); x is the number ofiIs the ith row of the matrix X, i is a natural number from 1 to n;
Figure BDA0001379463550000022
given P, Q, xiA likelihood function of (a);
in proportion to
Figure BDA0001379463550000023
Posterior distribution of
Figure BDA0001379463550000024
Sampling, wherein P and Q obtained by sampling are output data meeting the epsilon-difference privacy;
Figure BDA0001379463550000025
is a priori distribution over P and Q.
The data mining method for differential privacy protection provided by the specification comprises the following steps:
acquiring a matrix P with dimensions of n x k; the matrix P is proportional to
Figure BDA0001379463550000026
Figure BDA0001379463550000027
Posterior distribution of
Figure BDA0001379463550000028
Sampling to obtain; x is a matrix with n X d dimensions and can be decomposed into a product of a matrix P and a matrix Q with k X d dimensions; x is the number ofiIs the ith row of the matrix X;
Figure BDA0001379463550000029
given P, Q, xiA likelihood function of (a);
Figure BDA00013794635500000210
is a prior distribution over P and Q; b is determined according to the value ranges of X, P and Q and is not less than
Figure BDA00013794635500000211
The supremum of the maximum value of (a); n, k and d are natural numbers, i is a natural number from 1 to n, and epsilon is a differential privacy protection parameter;
and generating a training sample by adopting the matrix P, and training the data mining model.
The present specification also provides an apparatus for differential privacy protection of sensitive data, where the sensitive data is a matrix X of dimensions n × d, X is decomposable to a product of a matrix P of dimensions n × k and a matrix Q of dimensions k × d, and n, k, and d are natural numbers, the apparatus including:
a supremum determining unit for determining not less than the value range of X, P and Q
Figure BDA00013794635500000212
The supremum of (B); x is the number ofiIs the ith row of the matrix X, i is a natural number from 1 to n;
Figure BDA00013794635500000213
given P, Q, xiA likelihood function of (a);
a posterior sampling unit for proportional to
Figure BDA0001379463550000031
Posterior distribution of
Figure BDA0001379463550000032
Sampling, wherein P and Q obtained by sampling are output data meeting the epsilon-difference privacy;
Figure BDA0001379463550000033
is a priori distribution over P and Q.
The present specification provides a data mining apparatus with differential privacy protection, including:
a protection data acquisition unit for acquiring a matrix P of n × k dimensions; the matrix P is proportional to
Figure BDA0001379463550000034
Posterior distribution of
Figure BDA0001379463550000035
Sampling to obtain; x is a matrix with n X d dimensions and can be decomposed into a product of a matrix P and a matrix Q with k X d dimensions; x is the number ofiIs the ith row of the matrix X;
Figure BDA0001379463550000036
given P, Q, xiA likelihood function of (a);
Figure BDA0001379463550000037
is a prior distribution over P and Q; b is determined according to the value ranges of X, P and Q and is not less than
Figure BDA0001379463550000038
The supremum of the maximum value of (a); n, k and d are natural numbers, i is a natural number from 1 to n, and epsilon is a differential privacy protection parameter;
and the model training unit is used for generating training samples by adopting the matrix P and training the data mining model.
This specification provides a computer device comprising: a memory and a processor; the memory having stored thereon a computer program executable by the processor; and when the processor runs the computer program, executing the steps of the differential privacy protection method for the sensitive data.
This specification provides a computer device comprising: a memory and a processor; the memory having stored thereon a computer program executable by the processor; and when the processor runs the computer program, the steps of the data mining method for differential privacy protection are executed.
The present specification provides a computer readable storage medium, on which a computer program is stored, and the computer program is executed by a processor to perform the steps of the differential privacy protection method for sensitive data.
The present specification also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the differential privacy preserving data mining method described above.
As can be seen from the above technical solutions, in the embodiments of the present specification, the sensitive data matrix X is decomposed into a matrix P and a matrix Q, and is determined to be not less than
Figure BDA0001379463550000039
The supremum B of (P, Q) satisfies the 4B-difference privacy characteristic by using the statistic obtained by the posterior distribution sampling of (P, Q), and is constructed in proportion to
Figure BDA0001379463550000041
Posterior distribution of
Figure BDA0001379463550000042
Sampling of P and Q is carried out, so that a matrix decomposition result meeting the epsilon-difference privacy can be obtained without matrix inversion, and large-scale privacy data can be protected fast and efficiently; meanwhile, the method is suitable for the matrix X distributed at will, and can be widely applied to various scenes.
Drawings
FIG. 1 is a flow chart of a method for differential privacy protection of sensitive data in an embodiment of the present description;
FIG. 2 is a flow diagram of a data mining method for differential privacy protection in an embodiment of the present description;
FIG. 3 is a schematic diagram of a privacy preserving and data mining process for sensitive data in an example of application of the present specification;
FIG. 4 is a hardware block diagram of an apparatus for carrying out embodiments of the present description;
FIG. 5 is a logic structure diagram of a differential privacy protection apparatus for sensitive data according to an embodiment of the present disclosure;
fig. 6 is a logical block diagram of a data mining device for differential privacy protection in an embodiment of the present specification.
Detailed Description
Matrix Factorization (MF), which breaks down an original Matrix into products of several matrices, is a dimension reduction technique. The matrix decomposition technique achieves the purpose of compressing, representing and approximating the original matrix by finding the effective low-dimensional features in the original matrix. Thus, the low rank matrix resulting from the matrix decomposition technique is able to retain most of the information content of the original matrix, but is different from the original matrix, so that the matrix decomposition technique can be used to process sensitive data.
Assuming that the sensitive data has n (n is a natural number) records, and each record has d (d is a natural number) data items, the sensitive data can be expressed as a matrix X with n X d dimensions. Matrix decomposition is performed on X, that is, an n × k (k is a natural number) dimensional matrix P and a k × d dimensional matrix Q satisfying formula 1 are found:
x is PQ formula 1
The low rank matrix P or Q is an approximation of the sensitive data (i.e., matrix X), which retains a large amount of information in the sensitive data. If it is desired to use the low rank matrix P or Q as privacy preserving data, it is also necessary to satisfy the condition that it is difficult to reversely derive the sensitive data matrix by the low rank matrix, and the matrix P or Q satisfying the differential privacy requirement may satisfy the above condition.
Embodiments of the present description provide a new differential privacy protection method for sensitive data and a new data mining method for differential privacy protection, which set values for P and QRange of not less than
Figure BDA0001379463550000051
Is taken as a supremum B, and the statistics based on the (P, Q) posterior distribution satisfy the 4B-differential privacy characteristic, which is proportional to
Figure BDA0001379463550000052
The (P, Q) posterior distribution of (e) is sampled to obtain P and Q satisfying e-differential privacy. The embodiment of the specification does not relate to inverse matrix sampling for posterior distribution, can quickly and efficiently obtain a decomposed matrix, and is suitable for processing large-scale data or online data; and there is no requirement for the distribution function of the sensitive data.
Embodiments of the present description may be implemented on any device with computing and storage capabilities, such as a mobile phone, a tablet Computer, a PC (Personal Computer), a notebook, a server, and so on; the functions in the embodiments of the present specification may also be implemented by a logical node operating in two or more devices.
In the embodiment of the present specification, a flow of a differential privacy protection method for sensitive data is shown in fig. 1.
Step 110, according to the value ranges of X, P and Q, determining that the value is not less than
Figure BDA0001379463550000053
Upper limit B of the maximum value of (a).
In the embodiments of the present disclosure, the matrices X, P and Q have respective ranges. That is, for matrix X, there are two real numbers maxXAnd minXAnd satisfies formula 2:
minX≤xi,j≤maxXformula 2
In formula 2, xi,jIs the element in ith row and jth column of matrix X, i is a natural number from 1 to n, and j is a natural number from 1 to d.
Similarly, there are two such real numbers for matrices P and Q, respectively.
In an actual application scene, most of original data from a service system has a value range; even for individual specific data items without value ranges, the data items with the value ranges can be converted into the data items with the value ranges through some simple processing modes, and the meaning represented by the data items is not influenced. For example, a data item without a value range can be mapped to a data item with a value range through a proper mathematical function; for another example, the value range may be defined by dividing the data items without value range into several levels, and representing the specific values with the level values.
In one implementation, the raw data may be represented as a matrix X in n X d dimensionspIn the pair XpAnd taking the matrix X obtained after normalization processing according to the columns as a sensitive data matrix. Thus, xi,jHas a value range of [0,1 ]]。
For the matrices P and Q, the value ranges of the elements can be preset. Because the information contained in the matrix X is embodied on the relative value between the elements in P and Q, the limit value range does not influence the approximation degree of the information carried in P and Q and the information in X.
Figure BDA0001379463550000061
Is x at P, QiOf a likelihood function of (2), wherein xiIs the ith row of the matrix X; indicating the likelihood that X is observed given P and Q.
Figure BDA0001379463550000062
Is a logarithmic form of the likelihood function.
In the embodiment of the specification, the matrix X obeys a certain distribution, and the form of the distribution function of the matrix X can be predetermined according to factors such as the requirement of an actual application scene, the characteristics of sensitive data and the like; the parameters of the X distribution function may be determined using sensitive data according to the prior art, and are not described in detail.
For a matrix X that follows any one of the distributions, after the form and parameters of its distribution function have been determined,
Figure BDA0001379463550000063
will be composed of xiThe value of the element(s) in P and Q, and the form and parameters of the distribution function of X. Thus, can be based on xiThe value ranges of the elements in P and Q and the parameters of the X distribution function to obtain
Figure BDA0001379463550000064
To thereby determine the range of values of
Figure BDA0001379463550000065
Is measured.
How to determine X is based on the normal distribution is described below as an example
Figure BDA0001379463550000066
Is measured. Assuming that the standard deviation of the normal distribution of X is σ, equation 3 holds:
Figure BDA0001379463550000067
in formula 3, QTIs a transposed matrix of Q, piIs line i of P, qjColumn j of Q, Const is determined by the mean and standard deviation σ of the normal distribution of X, and is equal to XiA constant independent of the values of the elements in P and Q.
It can be seen that xiThe value ranges of the elements in P and Q are applied to the formula 3, and then- (x) can be determinedi,j-pi·qj)2Range of possible values, thereby obtaining
Figure BDA0001379463550000071
Range of values of (a). For X that follows a normal distribution,
Figure BDA0001379463550000072
is determined by the value ranges of the matrix X, P and Q, and the mean and variance of the normal distribution to which X follows.
The matrix X having other forms of distribution functionsWhen, for example, a position (poisson) distribution, a discrete distribution, a beta distribution, etc., the corresponding distribution can also be distributed by means of this form
Figure BDA0001379463550000073
The expression of (2) is obtained by the value ranges and distribution parameters of the elements in X, P and Q
Figure BDA0001379463550000074
Is measured. And will not be described in detail.
In the embodiments of the present specification, it may be not less than
Figure BDA0001379463550000075
A certain value of the maximum value of (B) as the supremum B. In other words, the supremum boundary B satisfies equation 4, i.e., for all possible xiB satisfies the condition of being not less than maximum
Figure BDA0001379463550000076
Figure BDA0001379463550000077
Step 120, in proportion to
Figure BDA0001379463550000078
Posterior distribution of
Figure BDA0001379463550000079
Sampling, wherein P and Q obtained by sampling are output data meeting the epsilon-difference privacy;
Figure BDA00013794635500000710
is a prior distribution of P and Q.
Suppose that one of the matrices X is recorded as XkIs replaced by x'kIf we get matrix X', let θ be (P, Q), then sample an arbitrary a posteriori distribution of (P, Q):
Figure BDA0001379463550000081
because:
Figure BDA0001379463550000082
Figure BDA0001379463550000083
can be seen from any
Figure BDA0001379463550000084
The sampled statistics satisfy 4B-differential privacy. Then for any differential privacy protection parameter e, from being proportional to
Figure BDA0001379463550000085
Posterior distribution of
Figure BDA0001379463550000086
Sampling is performed as shown in equation 5:
Figure BDA0001379463550000087
any P and Q obtained by sampling according to the formula 5 meet the requirement of epsilon-differential privacy and can be used as the output of differential privacy matrix decomposition.
In the embodiments of the present description, any sampling method may be used for sampling, and is not limited. For example, a Markov chain Monte Carlo (Markov chain Monte Carlo), a random Gradient Hamilton Monte Carlo (Stochastic Gradient Hamiltonian Monte Carlo), or the like can be used as the sampling method.
Matrices P and Q, which are full e-differential privacy requirements, can be used as data for which privacy protection has been accomplished. For example, in one example, the n × k dimensional matrix P may be provided to a data mining party as a data source or a part of the data source for data mining, so as to protect sensitive data and learn a new model by using information carried by the sensitive data as an input of a machine learning model.
In the embodiment of the present specification, a flow of a data mining method for differential privacy protection is shown in fig. 2.
In step 210, a matrix P of dimensions n x k is obtained. Matrix P is proportional to
Figure BDA0001379463550000091
Posterior distribution of
Figure BDA0001379463550000092
Sampling to obtain; e is a differential privacy protection parameter; x is a matrix with n X d dimensions and can be decomposed into a product of a matrix P and a matrix Q with k X d dimensions; x is the number ofiIs the ith row of the matrix X;
Figure BDA0001379463550000093
given P, Q, xiA likelihood function of (a);
Figure BDA0001379463550000094
is a prior distribution over P and Q; b is determined according to the value ranges of X, P and Q and is not less than
Figure BDA0001379463550000095
The supremum of (a).
By using the differential privacy protection method for sensitive data in the embodiment of the present specification, a party having sensitive data decomposes a sensitive data matrix X into matrices P and Q, and then provides the matrix P with dimensions n × k as a data source after differential privacy protection to a party performing data mining. The party performing data mining may obtain the matrix P from the party owning sensitive data in any way, and the embodiments of the present specification are not limited.
And step 220, training the data mining model by using the obtained training samples.
After obtaining the matrix P, the party performing data mining may train the data mining model by using the matrix P as a training sample with a capacity of n (i.e., using each row of P as a data record); or the matrix P may be used as a partial data source, and after data fusion is performed with other data sources, a training sample is generated, and then the generated training sample is used to train the data mining model.
The specific data fusion mode can be determined according to factors such as the characteristics of data in an actual application scene, the type of a data mining model and the like, and is not limited; for example, assuming that the matrix P contains the differential privacy protection data of n users, for a trusted data mining party, t (t is a natural number) data items without sensitive information of the same n users may be spliced with the matrix P to form n × t + t training samples.
In the embodiments of the present specification, the type of the data mining model and the specific training method are not limited.
It can be seen that in the embodiments of the present specification, the sensitive data matrix X is decomposed into a matrix P and a matrix Q having a certain value range so as to be not less than
Figure BDA0001379463550000101
Is proportional to the value of the maximum value of (P, Q) as the supremum B, satisfies the 4B-differential privacy characteristic using the statistics obtained from the posterior distribution sampling of (P, Q)
Figure BDA0001379463550000102
The (P, Q) posterior distribution of (A, B) is subjected to P and Q sampling, so that a matrix decomposition result meeting the e-difference privacy can be obtained without matrix inversion, a decomposition matrix of large-scale data or online data can be obtained quickly and efficiently, and no requirement is imposed on a distribution function of X.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
In one application example of the specification, a data provider entrusts a trusted data miner to mine user data, and provides the data miner with a data source required for partial mining. The data provider is the party who grasps the original data with user sensitive information, and the data miner grasps non-sensitive information by the same user group. Before the data mining party provides the data to the data mining party, the original data needs to be subjected to differential privacy protection. The specific data processing procedure is shown in fig. 3.
At a data provider, d sensitive data items of n users are constructed into an n-X-d original data matrix, and the matrix X is obtained after the original data matrix is subjected to normalization processing according to columns. The matrix X follows a normal distribution with mean μ and standard deviation σ.
At the data provider, according to equation 3, the determination is made
Figure BDA0001379463550000103
And B satisfying equation 4 is taken as a supremum. Then according to equation 5, is proportional to
Figure BDA0001379463550000104
Posterior distribution of
Figure BDA0001379463550000105
And sampling to obtain P and Q meeting the requirement of epsilon-difference privacy.
The data provider provides the matrix P to the data miner.
At the data mining side, t non-sensitive data items of the same n users are constructed into a non-sensitive data matrix with n x t dimensions. After the data mining side obtains the matrix P, the matrix P is obtained for each PiAnd the affiliated user performs data fusion on the non-sensitive data matrix and the matrix P, and generates an n-x (k + t) -dimensional matrix Y after the data fusion, wherein each row of the Y comprises t non-sensitive data items of the same user and k data items of the user in the matrix P.
And training the data mining model by taking the matrix Y as a training sample at a data mining party to obtain a result model.
Corresponding to the implementation of the above flow, embodiments of the present specification further provide a differential privacy protection device for sensitive data and a data mining device for differential privacy protection. Both of these means can be implemented by software, or by hardware, or by a combination of hardware and software. Taking a software implementation as an example, the logical device is formed by reading a corresponding computer program instruction into a memory for running through a Central Processing Unit (CPU) of the device. In terms of hardware, the device in which the apparatus is located generally includes other hardware such as a chip for transmitting and receiving wireless signals and/or other hardware such as a board for implementing a network communication function, in addition to the CPU, the memory, and the storage shown in fig. 4.
Fig. 5 illustrates a differential privacy protection apparatus for sensitive data, according to an embodiment of the present disclosure, where the sensitive data is a matrix X of dimensions n × d, X may be decomposed into a product of a matrix P of dimensions n × k and a matrix Q of dimensions k × d, and n, k, and d are natural numbers, the apparatus includes an infimum determination unit and an a posteriori sampling unit, where: the supremum determining unit is used for determining that the value is not less than the value range of X, P and Q
Figure BDA0001379463550000111
The supremum of (B); x is the number ofiIs the ith row of the matrix X, i is a natural number from 1 to n;
Figure BDA0001379463550000112
given P, Q, xiA likelihood function of (a); the posterior sampling unit is used for proportional
Figure BDA0001379463550000113
Posterior distribution of
Figure BDA0001379463550000114
Sampling is carried out, and the P and Q obtained by sampling are output numbers meeting the epsilon-difference privacyAccordingly;
Figure BDA0001379463550000115
is a priori distribution over P and Q.
Optionally, the apparatus further includes a sampling data output unit, configured to provide the matrix P to a data mining party, so that the data mining party performs data mining on the matrix P as at least part of the data source.
Optionally, the apparatus further includes a normalization processing unit, configured to perform normalization processing on the matrix X by columns before determining the supremum boundary B.
Optionally, X follows a normal distribution; the above-mentioned
Figure BDA0001379463550000116
Is determined according to the value ranges of X, P and Q, and the mean and variance of the normal distribution.
Optionally, the posterior sampling unit is specifically configured to: in proportion to
Figure BDA0001379463550000121
Posterior distribution of
Figure BDA0001379463550000122
And carrying out sampling by adopting a Markov chain Monte Carlo sampling method or a random gradient Hamilton Monte Carlo sampling method.
Fig. 6 is a diagram illustrating a data mining apparatus for differential privacy protection according to an embodiment of the present disclosure, where the data mining apparatus includes a protected data obtaining unit and a model training unit, where: the protection data acquisition unit is used for acquiring a matrix P with n x k dimensions; the matrix P is proportional to
Figure BDA0001379463550000123
Posterior distribution of
Figure BDA0001379463550000124
Sampling to obtain; x is a matrix with n X d dimensions and can be decomposed into a product of a matrix P and a matrix Q with k X d dimensions; x is the number ofiIs the ith row of the matrix X;
Figure BDA0001379463550000125
given P, Q, xiA likelihood function of (a);
Figure BDA0001379463550000126
is a prior distribution over P and Q; b is determined according to the value ranges of X, P and Q and is not less than
Figure BDA0001379463550000127
The supremum of the maximum value of (a); n, k and d are natural numbers, i is a natural number from 1 to n, and epsilon is a differential privacy protection parameter; and the model training unit is used for generating training samples by adopting the matrix P and training the data mining model.
Optionally, the model training unit is specifically configured to: and taking the matrix P as a partial data source, performing data fusion with other data sources to generate a training sample, and training the data mining model.
Embodiments of the present description provide a computer device that includes a memory and a processor. Wherein the memory has stored thereon a computer program executable by the processor; the processor, when executing the stored computer program, performs the steps of the differential privacy protection method for sensitive data in the embodiments of the present specification. For a detailed description of the steps of the differential privacy protection method for sensitive data, reference is made to the preceding contents, which are not repeated.
Embodiments of the present description provide a computer device that includes a memory and a processor. Wherein the memory has stored thereon a computer program executable by the processor; the processor, when executing the stored computer program, performs the steps of the differential privacy preserving data mining method of the embodiments of the present specification. For a detailed description of the steps of the data mining method for differential privacy protection, refer to the previous contents and are not repeated.
Embodiments of the present description provide a computer-readable storage medium having stored thereon computer programs which, when executed by a processor, perform the steps of the differential privacy protection method for sensitive data in embodiments of the present description. For a detailed description of the steps of the differential privacy protection method for sensitive data, reference is made to the preceding contents, which are not repeated.
Embodiments of the present description provide a computer-readable storage medium having stored thereon computer programs which, when executed by a processor, perform the steps of the differential privacy preserving data mining method of embodiments of the present description. For a detailed description of the steps of the data mining method for differential privacy protection, refer to the previous contents and are not repeated.
The above description is only exemplary of the present invention and should not be taken as limiting the scope of the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.

Claims (18)

1. A method of differential privacy protection for sensitive data, the sensitive data being a matrix X of dimensions n X d, X being resolvable as a product of a matrix P of dimensions n X k and a matrix Q of dimensions k X d, n, k, d being natural numbers, the method comprising:
according to the value ranges of X, P and Q, determining that the value is not less than
Figure FDA0003241599880000011
The supremum of (B); x is the number ofiIs the ith row of the matrix X, i is a natural number from 1 to n;
Figure FDA0003241599880000012
given P, Q, xiA likelihood function of (a);
in proportion to
Figure FDA0003241599880000013
Posterior distribution of
Figure FDA0003241599880000014
Sampling, wherein P and Q obtained by sampling are output data meeting the epsilon-difference privacy;
Figure FDA0003241599880000015
is a prior distribution over P and Q; the sampling is performed according to the following equation:
Figure FDA0003241599880000016
2. the method of claim 1, further comprising: and providing the matrix P to a data mining party, and enabling the data mining party to perform data mining by using the matrix P as at least part of data source.
3. The method of claim 1, further comprising: before the supremum boundary B is determined, the matrix X is normalized by columns.
4. The method of claim 1, wherein X follows a normal distribution; the above-mentioned
Figure FDA0003241599880000017
Is determined according to the value ranges of X, P and Q, and the mean and variance of the normal distribution.
5. The method of claim 1, wherein said compliance is proportional to
Figure FDA0003241599880000018
Posterior distribution of
Figure FDA0003241599880000019
Sampling is carried out, including: in proportion to
Figure FDA00032415998800000110
Posterior distribution of
Figure FDA00032415998800000111
And carrying out sampling by adopting a Markov chain Monte Carlo sampling method or a random gradient Hamilton Monte Carlo sampling method.
6. A differential privacy protected data mining method, comprising:
acquiring a matrix P with dimensions of n x k; the matrix P is proportional to
Figure FDA00032415998800000112
Figure FDA00032415998800000113
Posterior distribution of
Figure FDA00032415998800000114
Sampling to obtain; the sampling is performed according to the following equation:
Figure FDA0003241599880000021
x is a matrix with n X d dimensions and can be decomposed into a product of a matrix P and a matrix Q with k X d dimensions; x is the number ofiIs the ith row of the matrix X;
Figure FDA0003241599880000022
given P, Q, xiA likelihood function of (a);
Figure FDA0003241599880000023
is a prior distribution over P and Q; b is determined according to the value ranges of X, P and Q and is not less than
Figure FDA0003241599880000024
The supremum of the maximum value of (a); n, k and d are natural numbers, i is a natural number from 1 to n, and epsilon is a differential privacy protection parameter;
and generating a training sample by adopting the matrix P, and training the data mining model.
7. The method of claim 6, wherein the training of the data mining model using the training samples generated by the matrix P comprises: and taking the matrix P as a partial data source, performing data fusion with other data sources to generate a training sample, and training the data mining model.
8. An apparatus for differential privacy protection of sensitive data, the sensitive data being a matrix X of dimensions n X d, X being resolvable as a product of a matrix P of dimensions n X k and a matrix Q of dimensions k X d, n, k, d being natural numbers, the apparatus comprising:
a supremum determining unit for determining not less than the value range of X, P and Q
Figure FDA0003241599880000025
The supremum of (B); x is the number ofiIs the ith row of the matrix X, i is a natural number from 1 to n;
Figure FDA0003241599880000026
given P, Q, xiA likelihood function of (a);
a posterior sampling unit for proportional to
Figure FDA0003241599880000027
Posterior distribution of
Figure FDA0003241599880000028
Sampling, wherein P and Q obtained by sampling are output data meeting the epsilon-difference privacy;
Figure FDA0003241599880000029
is a prior distribution over P and Q; the sampling is performed according to the following equation:
Figure FDA00032415998800000210
Figure FDA00032415998800000211
9. the apparatus of claim 8, the apparatus further comprising: and the sampling data output unit is used for providing the matrix P to the data mining party, and the data mining party performs data mining by taking the matrix P as at least part of data source.
10. The apparatus of claim 8, the apparatus further comprising: and the normalization processing unit is used for performing normalization processing on the matrix X according to columns before the supremum boundary B is determined.
11. The apparatus of claim 8, the X obeys a normal distribution; the above-mentioned
Figure FDA00032415998800000212
Is determined according to the value ranges of X, P and Q, and the mean and variance of the normal distribution.
12. The apparatus according to claim 8, the a posteriori sampling unit being specifically configured to: in proportion to
Figure FDA0003241599880000031
Posterior distribution of
Figure FDA0003241599880000032
And carrying out sampling by adopting a Markov chain Monte Carlo sampling method or a random gradient Hamilton Monte Carlo sampling method.
13. A differential privacy protected data mining apparatus, comprising:
a protection data acquisition unit for acquiring a matrix P of n × k dimensions; the matrix P is proportional to
Figure FDA0003241599880000033
Posterior distribution of
Figure FDA0003241599880000034
Sampling to obtain; the sampling is performed according to the following equation:
Figure FDA0003241599880000035
x is a matrix with n X d dimensions and can be decomposed into a product of a matrix P and a matrix Q with k X d dimensions; x is the number ofiIs the ith row of the matrix X;
Figure FDA0003241599880000036
given P, Q, xiA likelihood function of (a);
Figure FDA0003241599880000037
is a prior distribution over P and Q; b is determined according to the value ranges of X, P and Q and is not less than
Figure FDA0003241599880000038
The supremum of the maximum value of (a); n, k and d are natural numbers, i is a natural number from 1 to n, and epsilon is a differential privacy protection parameter;
and the model training unit is used for generating training samples by adopting the matrix P and training the data mining model.
14. The apparatus of claim 13, the model training unit to: and taking the matrix P as a partial data source, performing data fusion with other data sources to generate a training sample, and training the data mining model.
15. A computer device, comprising: a memory and a processor; the memory having stored thereon a computer program executable by the processor; the processor, when executing the computer program, performs the method of any of claims 1 to 5.
16. A computer device, comprising: a memory and a processor; the memory having stored thereon a computer program executable by the processor; the processor, when executing the computer program, performs the method of any of claims 6 to 7.
17. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 5.
18. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 6 to 7.
CN201710697388.9A 2017-08-15 2017-08-15 Differential privacy protection method and device for sensitive data Active CN109409117B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710697388.9A CN109409117B (en) 2017-08-15 2017-08-15 Differential privacy protection method and device for sensitive data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710697388.9A CN109409117B (en) 2017-08-15 2017-08-15 Differential privacy protection method and device for sensitive data

Publications (2)

Publication Number Publication Date
CN109409117A CN109409117A (en) 2019-03-01
CN109409117B true CN109409117B (en) 2021-10-22

Family

ID=65454117

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710697388.9A Active CN109409117B (en) 2017-08-15 2017-08-15 Differential privacy protection method and device for sensitive data

Country Status (1)

Country Link
CN (1) CN109409117B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102801629A (en) * 2012-08-22 2012-11-28 电子科技大学 Traffic matrix estimation method
CN103034665A (en) * 2011-10-10 2013-04-10 阿里巴巴集团控股有限公司 Information searching method and device
CN104007431A (en) * 2014-05-29 2014-08-27 西安电子科技大学 Radar HRRP target recognition method based on dpLVSVM model

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8555400B2 (en) * 2011-02-04 2013-10-08 Palo Alto Research Center Incorporated Privacy-preserving aggregation of Time-series data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103034665A (en) * 2011-10-10 2013-04-10 阿里巴巴集团控股有限公司 Information searching method and device
CN102801629A (en) * 2012-08-22 2012-11-28 电子科技大学 Traffic matrix estimation method
CN104007431A (en) * 2014-05-29 2014-08-27 西安电子科技大学 Radar HRRP target recognition method based on dpLVSVM model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
一种基于差分隐私的数据发布方法;马跃雷;《北京信息科技大学学报》;20160918;第31卷(第3期);第27-30页 *
基于聚类匿名化的差分隐私保护数据发布方法;刘晓迁等;《通信学报》;20160719;第37卷(第5期);第125-129页 *
面向频繁模式挖掘的差分隐私保护研究综述;丁丽萍等;《通信学报》;20141201;第35卷(第10期);第200-209页 *

Also Published As

Publication number Publication date
CN109409117A (en) 2019-03-01

Similar Documents

Publication Publication Date Title
CN109308418B (en) Model training method and device based on shared data
CN107292186B (en) Model training method and device based on random forest
EP3644231A1 (en) Training sample generation method and device based on privacy protection
US20200065710A1 (en) Normalizing text attributes for machine learning models
CN111401700A (en) Data analysis method, device, computer system and readable storage medium
US20220237323A1 (en) Compatible anonymization of data sets of different sources
US20180137149A1 (en) De-identification data generation apparatus, method, and non-transitory computer readable storage medium thereof
CN109409117B (en) Differential privacy protection method and device for sensitive data
JP2018055057A (en) Data disturbing device, method and program
Das Practical AI for cybersecurity
CN115544569A (en) Privacy XGboost method applied to financial scene
Toulias et al. Generalizations of entropy and information measures
CN112529767B (en) Image data processing method, device, computer equipment and storage medium
Kumar et al. Variational optimization of informational privacy
CN113221717A (en) Model construction method, device and equipment based on privacy protection
CN111767474A (en) Method and equipment for constructing user portrait based on user operation behaviors
CN111581068A (en) Terminal workload calculation method and device, storage medium, terminal and cloud service system
Xu et al. Handling missing extremes in tail estimation
Pereira et al. Automatic Delta-Adjustment Method Applied to Missing Not At Random Imputation
Adkinson Orellana et al. A new approach for dynamic and risk-based data anonymization
CN109598393A (en) A kind of analysis method and device of the influence information that event generates enterprise
CN110334342B (en) Word importance analysis method and device
McGrory et al. Weighted Gibbs sampling for mixture modelling of massive datasets via coresets
CN113822309B (en) User classification method, apparatus and non-volatile computer readable storage medium
US11451375B2 (en) System, method and apparatus for privacy preserving inference

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20191210

Address after: P.O. Box 31119, grand exhibition hall, hibiscus street, 802 West Bay Road, Grand Cayman, Cayman Islands

Applicant after: Innovative advanced technology Co., Ltd

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Co., Ltd.

REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40004770

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant